Maven Central

For Scala 2.10

<dependency>
  <groupId>com.github.itspawanbhardwaj</groupId>
  <artifactId>spark-fuzzy-matching_2.10</artifactId>
  <version>1.0.0</version>
</dependency>

For Scala 2.11

<dependency>
  <groupId>com.github.itspawanbhardwaj</groupId>
  <artifactId>spark-fuzzy-matching_2.11</artifactId>
  <version>1.0.1</version>
</dependency>

Metrics and algorithms

Functions

Example

The project contains a FuzzyMatchingJoinExample which works as follows:

Dataset with proper names
+--------------------+--------------------+-------+
|               title|               gener|ratings|
+--------------------+--------------------+-------+
|The Shawshank Red...|        Crime. Drama|    9.3|
|       The Godfather|        Crime. Drama|    9.2|
|     The Dark Knight|Action. Crime. Drama|    9.0|
|The Godfather: Pa...|        Crime. Drama|    9.0|
|        Pulp Fiction|        Crime. Drama|    8.9|
+--------------------+--------------------+-------+
only showing top 5 rows

Dataset with misspelled names
+--------------------+----+--------+
|               title|year|duration|
+--------------------+----+--------+
|dhe Shwshnk Redem...|1994|     142|
|        dhe Godfdher|1972|     175|
|      dhe Drk Knighd|2008|     152|
|dhe Godfdher: Prd II|1974|     202|
|        Pulp Ficdion|1994|     154|
+--------------------+----+--------+
only showing top 5 rows

Dataset after fuzzy join
+--------------------+--------------------+-------+--------------------+----+--------+
|               title|               gener|ratings|               title|year|duration|
+--------------------+--------------------+-------+--------------------+----+--------+
|The Shawshank Red...|        Crime. Drama|    9.3|dhe Shwshnk Redem...|1994|     142|
|       The Godfather|        Crime. Drama|    9.2|        dhe Godfdher|1972|     175|
|     The Dark Knight|Action. Crime. Drama|    9.0|      dhe Drk Knighd|2008|     152|
|        Pulp Fiction|        Crime. Drama|    8.9|        Pulp Ficdion|1994|     154|
|    Schindler's List|Biography. Drama....|    8.9|    Schindler's Lisd|1993|     195|
+--------------------+--------------------+-------+--------------------+----+--------+
only showing top 5 rows

Library used

stringmetric ( :dart: String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein). )