
Making similarity functions and phonetic algorithms readily available for fuzzy matching analyses in Spark.

Project Setup

Update your build.sbt file to import the libraries.

libraryDependencies += "org.apache.commons" % "commons-text" % "1.1"
libraryDependencies += "mrpowers" % "spark-stringmetric" % "2.2.0_0.1.0"


How to import the functions.

import com.github.mrpowers.spark.stringmetric.SimilarityFunctions._

Here's an example on how to use the jaccard_similarity function.

Suppose we have the following sourceDF:

|  word1|  word2|
|  night|  nacht|
|   null|  nacht|
|   null|   null|

Let's run the jaccard_similarity function.

val actualDF = sourceDF.withColumn(
  jaccard_similarity(col("word1"), col("word2"))

We can run to view the w1_w2_jaccard column that's been appended to the DataFrame.

|  word1|  word2|w1_w2_jaccard|
|  night|  nacht|         0.43|
|context|contact|         0.57|
|   null|  nacht|         null|
|   null|   null|         null|


How to import the functions.

import com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._

Here's an example on how to use the refined_soundex function.

Suppose we have the following sourceDF:

|  cat|
| null|

Let's run the refined_soundex function.

val actualDF = sourceDF.withColumn(

We can run to view the word1_refined_soundex column that's been appended to the DataFrame.

|night|               N80406|
|  cat|                 C306|
| null|                 null|


To make a SNAPSHOT release, update publishVersion to be something like this in the file:

def publishVersion = s"0.3.0_spark${binaryVersion(crossSparkVersion)}-SNAPSHOT"
mill mill.scalalib.PublishModule/publishAll --sonatypeCreds "usename:password" --publishArtifacts __.publishArtifacts --release false