Java Code Examples for org.apache.commons.lang.StringUtils#getLevenshteinDistance()

The following examples show how to use org.apache.commons.lang.StringUtils#getLevenshteinDistance() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.

Example 1

Source File: ValueDataUtil.java From pentaho-kettle with Apache License 2.0

5 votes

/**
 * Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source
 * string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions
 * required to transform s into t.
 */
public static Long getLevenshtein_Distance( ValueMetaInterface metaA, Object dataA, ValueMetaInterface metaB,
  Object dataB ) {
  if ( dataA == null || dataB == null ) {
    return null;
  }
  return new Long( StringUtils.getLevenshteinDistance( dataA.toString(), dataB.toString() ) );
}

Example 2

Source File: DuplicateDataDetector.java From rya with Apache License 2.0

5 votes

@Override
public boolean areObjectsApproxEquals(final IRI lhs, final IRI rhs) {
    if (isOnlyOneNull(lhs, rhs)) {
        return false;
    }
    if (Objects.equals(lhs, rhs)) {
        return true;
    }
    final String uriString1 = lhs.stringValue();
    final String uriString2 = rhs.stringValue();
    if (StringUtils.equalsIgnoreCase(uriString1, uriString2)) {
        // They're exactly equals so get out
        return true;
    } else if (tolerance.getValue() == 0) {
        // If they're not exactly equals with zero tolerance then get out
        return false;
    }
    final int distance = StringUtils.getLevenshteinDistance(uriString1, uriString2);
    // Check based on tolerance
    switch (tolerance.getToleranceType()) {
        case PERCENTAGE:
            if (uriString1.length() == 0) {
                return uriString1.length() == uriString2.length();
            }
            if (tolerance.getValue() >= 1) {
                return true;
            }
            return ((double)distance / uriString1.length()) <= tolerance.getValue();
        case DIFFERENCE:
        default:
            return distance <= tolerance.getValue();
    }
}

Example 3

Source File: DuplicateDataDetector.java From rya with Apache License 2.0

5 votes

@Override
public boolean areObjectsApproxEquals(final String lhs, final String rhs) {
    if (isOnlyOneNull(lhs, rhs)) {
        return false;
    }
    if (StringUtils.equalsIgnoreCase(lhs, rhs)) {
        // They're exactly equals so get out
        return true;
    } else if (tolerance.getValue() == 0) {
        // If they're not exactly equals with zero tolerance then get out
        return false;
    }

    // Only check one-way. Terms are not bi-directionally equivalent
    // unless specified.
    final List<String> lhsTermEquivalents = equivalentTermsMap.get(lhs);
    if (lhsTermEquivalents != null && lhsTermEquivalents.contains(rhs)) {
        return true;
    }
    final int distance = StringUtils.getLevenshteinDistance(lhs, rhs);
    // Check based on tolerance
    switch (tolerance.getToleranceType()) {
        case PERCENTAGE:
            if (lhs.length() == 0) {
                return lhs.length() == rhs.length();
            }
            if (tolerance.getValue() >= 1) {
                return true;
            }
            return ((double)distance / lhs.length()) <= tolerance.getValue();
        case DIFFERENCE:
        default:
            return distance <= tolerance.getValue();
    }
}

Example 4

Source File: MCRAbstractMerger.java From mycore with GNU General Public License v3.0

5 votes

/**
 *  Two abstracts are regarded probably same
 *  if their levenshtein distance is less than a configured percentage of the text length.
 */
@Override
public boolean isProbablySameAs(MCRMerger other) {
    if (!(other instanceof MCRAbstractMerger)) {
        return false;
    }

    String textOther = ((MCRAbstractMerger) other).text;
    int length = Math.min(text.length(), textOther.length());
    int distance = StringUtils.getLevenshteinDistance(text, textOther);
    System.out.println(distance);
    return (distance * 100 / length) < MAX_DISTANCE_PERCENT;
}

Example 5

Source File: ISimilarityMatcher.java From xtext-eclipse with Eclipse Public License 2.0

5 votes

@Override
public boolean isSimilar(String s0, String s1) {
	if(Strings.isEmpty(s0) || Strings.isEmpty(s1)) {
		return false;
	}
	double levenshteinDistance = StringUtils.getLevenshteinDistance(s0, s1);
	return levenshteinDistance <= 1;
}

Example 6

Source File: ValueDataUtil.java From hop with Apache License 2.0

5 votes

/**
 * Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source
 * string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions
 * required to transform s into t.
 */
public static Long getLevenshtein_Distance( IValueMeta metaA, Object dataA, IValueMeta metaB,
                                            Object dataB ) {
  if ( dataA == null || dataB == null ) {
    return null;
  }
  return new Long( StringUtils.getLevenshteinDistance( dataA.toString(), dataB.toString() ) );
}

Example 7

Source File: NameMatcher.java From Pushjet-Android with BSD 2-Clause "Simplified" License

4 votes

/**
 * Locates the best match for the given pattern in the given set of candidate items.
 *
 * @return The match if exactly 1 match found, null if no matches or multiple matches.
 */
public String find(String pattern, Collection<String> items) {
    this.pattern = pattern;
    matches.clear();
    candidates.clear();

    if (items.contains(pattern)) {
        matches.add(pattern);
        return pattern;
    }

    if (pattern.length() == 0) {
        return null;
    }

    Pattern camelCasePattern = getPatternForName(pattern);
    Pattern normalisedCamelCasePattern = Pattern.compile(camelCasePattern.pattern(), Pattern.CASE_INSENSITIVE);
    String normalisedPattern = pattern.toUpperCase();

    Set<String> caseInsensitiveMatches = new TreeSet<String>();
    Set<String> caseSensitiveCamelCaseMatches = new TreeSet<String>();
    Set<String> caseInsensitiveCamelCaseMatches = new TreeSet<String>();

    for (String candidate : items) {
        if (candidate.equalsIgnoreCase(pattern)) {
            caseInsensitiveMatches.add(candidate);
        }
        if (camelCasePattern.matcher(candidate).matches()) {
            caseSensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (normalisedCamelCasePattern.matcher(candidate).lookingAt()) {
            caseInsensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (StringUtils.getLevenshteinDistance(normalisedPattern, candidate.toUpperCase()) <= Math.min(3, pattern.length() / 2)) {
            candidates.add(candidate);
        }
    }

    if (!caseInsensitiveMatches.isEmpty()) {
        matches.addAll(caseInsensitiveMatches);
    } else if (!caseSensitiveCamelCaseMatches.isEmpty()) {
        matches.addAll(caseSensitiveCamelCaseMatches);
    } else {
        matches.addAll(caseInsensitiveCamelCaseMatches);
    }

    if (matches.size() == 1) {
        return matches.first();
    }

    return null;
}

Example 8

Source File: FuzzyMatch.java From pentaho-kettle with Apache License 2.0

4 votes

private Object[] doDistance( Object[] row ) throws KettleValueException {
  // Reserve room
  Object[] rowData = buildEmptyRow();

  Iterator<Object[]> it = data.look.iterator();

  long distance = -1;

  // Object o=row[data.indexOfMainField];
  String lookupvalue = getInputRowMeta().getString( row, data.indexOfMainField );

  while ( it.hasNext() ) {
    // Get cached row data
    Object[] cachedData = it.next();
    // Key value is the first value
    String cacheValue = (String) cachedData[0];

    int cdistance = -1;
    String usecacheValue = cacheValue;
    String uselookupvalue = lookupvalue;
    if ( !meta.isCaseSensitive() ) {
      usecacheValue = cacheValue.toLowerCase();
      uselookupvalue = lookupvalue.toLowerCase();
    }

    switch ( meta.getAlgorithmType() ) {
      case FuzzyMatchMeta.OPERATION_TYPE_DAMERAU_LEVENSHTEIN:
        cdistance = Utils.getDamerauLevenshteinDistance( usecacheValue, uselookupvalue );
        break;
      case FuzzyMatchMeta.OPERATION_TYPE_NEEDLEMAN_WUNSH:
        cdistance = Math.abs( (int) new NeedlemanWunsch().score( usecacheValue, uselookupvalue ) );
        break;
      default:
        cdistance = StringUtils.getLevenshteinDistance( usecacheValue, uselookupvalue );
        break;
    }

    if ( data.minimalDistance <= cdistance && cdistance <= data.maximalDistance ) {
      if ( meta.isGetCloserValue() ) {
        if ( cdistance < distance || distance == -1 ) {
          // Get closer value
          // minimal distance
          distance = cdistance;
          int index = 0;
          rowData[index++] = cacheValue;
          // Add metric value?
          if ( data.addValueFieldName ) {
            rowData[index++] = distance;
          }
          // Add additional return values?
          if ( data.addAdditionalFields ) {
            for ( int i = 0; i < meta.getValue().length; i++ ) {
              int nr = i + 1;
              int nf = i + index;
              rowData[nf] = cachedData[nr];
            }
          }
        }
      } else {
        // get all values separated by values separator
        if ( rowData[0] == null ) {
          rowData[0] = cacheValue;
        } else {
          rowData[0] = (String) rowData[0] + data.valueSeparator + cacheValue;
        }
      }
    }
  }

  return rowData;
}

Example 9

Source File: RevisedLesk.java From lesk-wsd-dsm with GNU General Public License v3.0

4 votes

private float computeLDscore(String s1, String s2) {
    float maxLength = (float) Math.max(s1.length(), s2.length());
    float ld = (float) StringUtils.getLevenshteinDistance(s1, s2);
    return 1 - ld / maxLength;
}

Example 10

Source File: NameMatcher.java From Pushjet-Android with BSD 2-Clause "Simplified" License

4 votes

/**
 * Locates the best match for the given pattern in the given set of candidate items.
 *
 * @return The match if exactly 1 match found, null if no matches or multiple matches.
 */
public String find(String pattern, Collection<String> items) {
    this.pattern = pattern;
    matches.clear();
    candidates.clear();

    if (items.contains(pattern)) {
        matches.add(pattern);
        return pattern;
    }

    if (pattern.length() == 0) {
        return null;
    }

    Pattern camelCasePattern = getPatternForName(pattern);
    Pattern normalisedCamelCasePattern = Pattern.compile(camelCasePattern.pattern(), Pattern.CASE_INSENSITIVE);
    String normalisedPattern = pattern.toUpperCase();

    Set<String> caseInsensitiveMatches = new TreeSet<String>();
    Set<String> caseSensitiveCamelCaseMatches = new TreeSet<String>();
    Set<String> caseInsensitiveCamelCaseMatches = new TreeSet<String>();

    for (String candidate : items) {
        if (candidate.equalsIgnoreCase(pattern)) {
            caseInsensitiveMatches.add(candidate);
        }
        if (camelCasePattern.matcher(candidate).matches()) {
            caseSensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (normalisedCamelCasePattern.matcher(candidate).lookingAt()) {
            caseInsensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (StringUtils.getLevenshteinDistance(normalisedPattern, candidate.toUpperCase()) <= Math.min(3, pattern.length() / 2)) {
            candidates.add(candidate);
        }
    }

    if (!caseInsensitiveMatches.isEmpty()) {
        matches.addAll(caseInsensitiveMatches);
    } else if (!caseSensitiveCamelCaseMatches.isEmpty()) {
        matches.addAll(caseSensitiveCamelCaseMatches);
    } else {
        matches.addAll(caseInsensitiveCamelCaseMatches);
    }

    if (matches.size() == 1) {
        return matches.first();
    }

    return null;
}

Example 11

Source File: JavaTypeQuickfixes.java From xtext-eclipse with Eclipse Public License 2.0

4 votes

protected boolean isSimilarTypeName(String s0, String s1) {
	double levenshteinDistance = StringUtils.getLevenshteinDistance(s0, s1);
	return levenshteinDistance <= 3;
}

Example 12

Source File: ShapeDistanceCollectiveAnswerScorer.java From bioasq with Apache License 2.0

4 votes

private double getDistance(String text1, String text2) {
  int distance = StringUtils.getLevenshteinDistance(text1, text2);
  return (double) distance / Math.max(text1.length(), text2.length());
}

Example 13

Source File: EditDistanceCollectiveAnswerScorer.java From bioasq with Apache License 2.0

4 votes

private double getDistance(String text1, String text2) {
  int distance = StringUtils.getLevenshteinDistance(text1, text2);
  return (double) distance / Math.max(text1.length(), text2.length());
}

Example 14

Source File: NameMatcher.java From pushfish-android with BSD 2-Clause "Simplified" License

4 votes

/**
 * Locates the best match for the given pattern in the given set of candidate items.
 *
 * @return The match if exactly 1 match found, null if no matches or multiple matches.
 */
public String find(String pattern, Collection<String> items) {
    this.pattern = pattern;
    matches.clear();
    candidates.clear();

    if (items.contains(pattern)) {
        matches.add(pattern);
        return pattern;
    }

    if (pattern.length() == 0) {
        return null;
    }

    Pattern camelCasePattern = getPatternForName(pattern);
    Pattern normalisedCamelCasePattern = Pattern.compile(camelCasePattern.pattern(), Pattern.CASE_INSENSITIVE);
    String normalisedPattern = pattern.toUpperCase();

    Set<String> caseInsensitiveMatches = new TreeSet<String>();
    Set<String> caseSensitiveCamelCaseMatches = new TreeSet<String>();
    Set<String> caseInsensitiveCamelCaseMatches = new TreeSet<String>();

    for (String candidate : items) {
        if (candidate.equalsIgnoreCase(pattern)) {
            caseInsensitiveMatches.add(candidate);
        }
        if (camelCasePattern.matcher(candidate).matches()) {
            caseSensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (normalisedCamelCasePattern.matcher(candidate).lookingAt()) {
            caseInsensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (StringUtils.getLevenshteinDistance(normalisedPattern, candidate.toUpperCase()) <= Math.min(3, pattern.length() / 2)) {
            candidates.add(candidate);
        }
    }

    if (!caseInsensitiveMatches.isEmpty()) {
        matches.addAll(caseInsensitiveMatches);
    } else if (!caseSensitiveCamelCaseMatches.isEmpty()) {
        matches.addAll(caseSensitiveCamelCaseMatches);
    } else {
        matches.addAll(caseInsensitiveCamelCaseMatches);
    }

    if (matches.size() == 1) {
        return matches.first();
    }

    return null;
}

Example 15

Source File: NameMatcher.java From pushfish-android with BSD 2-Clause "Simplified" License

4 votes

/**
 * Locates the best match for the given pattern in the given set of candidate items.
 *
 * @return The match if exactly 1 match found, null if no matches or multiple matches.
 */
public String find(String pattern, Collection<String> items) {
    this.pattern = pattern;
    matches.clear();
    candidates.clear();

    if (items.contains(pattern)) {
        matches.add(pattern);
        return pattern;
    }

    if (pattern.length() == 0) {
        return null;
    }

    Pattern camelCasePattern = getPatternForName(pattern);
    Pattern normalisedCamelCasePattern = Pattern.compile(camelCasePattern.pattern(), Pattern.CASE_INSENSITIVE);
    String normalisedPattern = pattern.toUpperCase();

    Set<String> caseInsensitiveMatches = new TreeSet<String>();
    Set<String> caseSensitiveCamelCaseMatches = new TreeSet<String>();
    Set<String> caseInsensitiveCamelCaseMatches = new TreeSet<String>();

    for (String candidate : items) {
        if (candidate.equalsIgnoreCase(pattern)) {
            caseInsensitiveMatches.add(candidate);
        }
        if (camelCasePattern.matcher(candidate).matches()) {
            caseSensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (normalisedCamelCasePattern.matcher(candidate).lookingAt()) {
            caseInsensitiveCamelCaseMatches.add(candidate);
            continue;
        }
        if (StringUtils.getLevenshteinDistance(normalisedPattern, candidate.toUpperCase()) <= Math.min(3, pattern.length() / 2)) {
            candidates.add(candidate);
        }
    }

    if (!caseInsensitiveMatches.isEmpty()) {
        matches.addAll(caseInsensitiveMatches);
    } else if (!caseSensitiveCamelCaseMatches.isEmpty()) {
        matches.addAll(caseSensitiveCamelCaseMatches);
    } else {
        matches.addAll(caseInsensitiveCamelCaseMatches);
    }

    if (matches.size() == 1) {
        return matches.first();
    }

    return null;
}

Example 16

Source File: FuzzyMatch.java From hop with Apache License 2.0

4 votes

private Object[] doDistance( Object[] row ) throws HopValueException {
  // Reserve room
  Object[] rowData = buildEmptyRow();

  Iterator<Object[]> it = data.look.iterator();

  long distance = -1;

  // Object o=row[data.indexOfMainField];
  String lookupvalue = getInputRowMeta().getString( row, data.indexOfMainField );

  while ( it.hasNext() ) {
    // Get cached row data
    Object[] cachedData = it.next();
    // Key value is the first value
    String cacheValue = (String) cachedData[ 0 ];

    int cdistance = -1;
    String usecacheValue = cacheValue;
    String uselookupvalue = lookupvalue;
    if ( !meta.isCaseSensitive() ) {
      usecacheValue = cacheValue.toLowerCase();
      uselookupvalue = lookupvalue.toLowerCase();
    }

    switch ( meta.getAlgorithmType() ) {
      case FuzzyMatchMeta.OPERATION_TYPE_DAMERAU_LEVENSHTEIN:
        cdistance = Utils.getDamerauLevenshteinDistance( usecacheValue, uselookupvalue );
        break;
      case FuzzyMatchMeta.OPERATION_TYPE_NEEDLEMAN_WUNSH:
        cdistance = Math.abs( (int) new NeedlemanWunsch().score( usecacheValue, uselookupvalue ) );
        break;
      default:
        cdistance = StringUtils.getLevenshteinDistance( usecacheValue, uselookupvalue );
        break;
    }

    if ( data.minimalDistance <= cdistance && cdistance <= data.maximalDistance ) {
      if ( meta.isGetCloserValue() ) {
        if ( cdistance < distance || distance == -1 ) {
          // Get closer value
          // minimal distance
          distance = cdistance;
          int index = 0;
          rowData[ index++ ] = cacheValue;
          // Add metric value?
          if ( data.addValueFieldName ) {
            rowData[ index++ ] = distance;
          }
          // Add additional return values?
          if ( data.addAdditionalFields ) {
            for ( int i = 0; i < meta.getValue().length; i++ ) {
              int nr = i + 1;
              int nf = i + index;
              rowData[ nf ] = cachedData[ nr ];
            }
          }
        }
      } else {
        // get all values separated by values separator
        if ( rowData[ 0 ] == null ) {
          rowData[ 0 ] = cacheValue;
        } else {
          rowData[ 0 ] = (String) rowData[ 0 ] + data.valueSeparator + cacheValue;
        }
      }
    }
  }

  return rowData;
}

Example 17

Source File: SpellCheckedMetadata.java From anthelion with Apache License 2.0

3 votes

/**
 * Get the normalized name of metadata attribute name. This method tries to
 * find a well-known metadata name (one of the metadata names defined in this
 * class) that matches the specified name. The matching is error tolerent. For
 * instance,
 * <ul>
 * <li>content-type gives Content-Type</li>
 * <li>CoNtEntType gives Content-Type</li>
 * <li>ConTnTtYpe gives Content-Type</li>
 * </ul>
 * If no matching with a well-known metadata name is found, then the original
 * name is returned.
 *
 * @param name
 *          Name to normalize
 * @return normalized name
 */
public static String getNormalizedName(final String name) {
  String searched = normalize(name);
  String value = NAMES_IDX.get(searched);

  if ((value == null) && (normalized != null)) {
    int threshold = searched.length() / TRESHOLD_DIVIDER;
    for (int i = 0; i < normalized.length && value == null; i++) {
      if (StringUtils.getLevenshteinDistance(searched, normalized[i]) < threshold) {
        value = NAMES_IDX.get(normalized[i]);
      }
    }
  }
  return (value != null) ? value : name;
}

Example 18

Source File: SpellCheckedMetadata.java From nutch-htmlunit with Apache License 2.0

3 votes

/**
 * Get the normalized name of metadata attribute name. This method tries to
 * find a well-known metadata name (one of the metadata names defined in this
 * class) that matches the specified name. The matching is error tolerent. For
 * instance,
 * <ul>
 * <li>content-type gives Content-Type</li>
 * <li>CoNtEntType gives Content-Type</li>
 * <li>ConTnTtYpe gives Content-Type</li>
 * </ul>
 * If no matching with a well-known metadata name is found, then the original
 * name is returned.
 *
 * @param name
 *          Name to normalize
 * @return normalized name
 */
public static String getNormalizedName(final String name) {
  String searched = normalize(name);
  String value = NAMES_IDX.get(searched);

  if ((value == null) && (normalized != null)) {
    int threshold = searched.length() / TRESHOLD_DIVIDER;
    for (int i = 0; i < normalized.length && value == null; i++) {
      if (StringUtils.getLevenshteinDistance(searched, normalized[i]) < threshold) {
        value = NAMES_IDX.get(normalized[i]);
      }
    }
  }
  return (value != null) ? value : name;
}