Fuzzy-Matcher

Introduction

A java-based library to match and group "similar" elements in a collection of documents.

Imagine working in a system with a collection of contacts and wanting to match and categorize contacts with similar names, addresses or other attributes. The Fuzzy Match matching algorithm can help you do this. The Fuzzy Match algorithm can even help you find duplicate contacts, or prevent your system from adding duplicates.

This library can act on any domain object, like contact, and find similarity for various use cases. It dives deep into each character and finds out the probability that 2 or more objects are similar.

What's Fuzzy

The contacts "Steven Wilson" living at "45th Avenue 5th st." and "Stephen Wilkson" living at "45th Ave 5th Street" might look like belonging to the same person. It's easy for humans to ignore the small variance in spelling in names, or ignore abbreviation used in address. But for a computer program they are not the same. The string Steven does not equals Stephen and neither does Street equals st. If our trusted computers can start looking at each character and the sequence in which they appear, it might look similar. Fuzzy matching algorithms is all about providing this level of magnification to our myopic machines.

How does this work

Breaking down your data

This algorithm accepts data in a list of entities called Document (like a contact entity in your system), which can contain 1 or more Element (like names, address, emails, etc). Internally each element is further broken down into 1 or more Token which are then matched using configurable MatchType

This combination to tokenize the data and then to match them can extract similarity in a wide variety of data types

Exact word match

Consider these Elements defined in two different Documents

With a simple tokenization process each word here can be considered a token, and if another element has the same word they are scored on the number of matching tokens. In this example the words Wayne and Grace match 2 words out of 3 total in each elements. A scoring mechanism will match them with a result of 0.67

Soundex word match

Consider these Elements in two different Documents

Here we do not just look at each word, but encode it using Soundex which gives a unique code for the phonetic spelling of the name. So in this example words Steven & Stephen will encode to S315 whereas the words Wilson & Wilkson encode to W425.

This allows both the elements to match exactly, and score at 1.0

NGram token match

In cases where breaking down the Elements in words is not feasible, we split it using NGrams. Take for examples emails

Here if we ignore the domain name and take 3 character sequence (tri-gram) of the data, tokens will look like this

Comparing these NGrams we have 7 out of the total 10 tokens match exactly which gives a score of 0.7

Nearest Neighbors match

In certain cases breaking down elements into tokens and comparing tokens is not an option. For example numeric values, like dollar amounts in a list of transactions

Here the first and third could belong to the same transaction, where the third is only missing some precession. The match is done not on tokens being equal but on the closeness (the neighborhood range) in which the values appear. This closeness is again configurable where a 99% closeness, will match them with a score of 1.0

A similar example can be thought of with Dates, where dates that are near to each other might point to the same event.

Four Stages of Fuzzy Match

Fuzzy Match

We spoke in detail on Token and MatchType which is the core of fuzzy matching, and touched upon Scoring which gives the measure of matching similar data. PreProcessing your data is a simple yet powerful mechanism that can help in starting with clean data before running a match. These 4 stages which are highly customizable can be used to tune and match a wide variety of data types

End User Configuration

All the configurable options defined above can be applied at various points in the library.

Predefined Element Types

Below is the list of predefined Element Types available with sensible defaults. These can be overridden by setters while creating an Element.

Element Type PreProcessing Function Tokenizer Function Match Type
NAME namePreprocessing() wordSoundexEncodeTokenizer() EQUALITY
TEXT removeSpecialChars() wordTokenizer() EQUALITY
ADDRESS addressPreprocessing() wordSoundexEncodeTokenizer() EQUALITY
EMAIL removeDomain() triGramTokenizer() EQUALITY
PHONE numericValue() decaGramTokenizer() EQUALITY
NUMBER numberPreprocessing() valueTokenizer() NEAREST_NEIGHBORS
DATE none() valueTokenizer() NEAREST_NEIGHBORS

Note: Since each element is unique in the way it should match, if you need to match a different element type than what is supported, please open a new GitHub Issue and the community will provide support and enhancement to this library

Document Configuration

Element Configuration

Match Service

It supports 3 ways to match the documents

matchService.applyMatchByDocId(List<Document> documents)
matchService.applyMatchByDocId(List<Document> documents, List<Document> matchWith)
matchService.applyMatchByDocId(Document document, List<Document> matchWith)

Match Results

The response of the library is essentially a Match<Document> object. It has 3 attributes

The response is grouped by the Data.key, so from any of the MatchService methods the response is map

Map<String, List<Match<Document>>>

Quick Start

Maven Import

The library is published to maven central

<dependency>
    <groupId>com.intuit.fuzzymatcher</groupId>
    <artifactId>fuzzy-matcher</artifactId>
    <version>1.0.3</version>
</dependency>

Input

This library takes a collection of Document objects with various Elements as input.

For example, if you have a multiple contacts as a simple String Arrays

String[][] input = {
        {"1", "Steven Wilson", "45th Avenue 5th st."},
        {"2", "John Doe", "546 freeman ave"},
        {"3", "Stephen Wilkson", "45th Ave 5th Street"}
};

Convert them into List of Document

List<Document> documentList = Arrays.asList(input).stream().map(contact -> {
    return new Document.Builder(contact[0])
            .addElement(new Element.Builder<String>().setValue(contact[1]).setType(NAME).createElement())
            .addElement(new Element.Builder<String>().setValue(contact[2]).setType(ADDRESS).createElement())
            .createDocument();
}).collect(Collectors.toList());

Applying the Match

The entry point for running this program is through MatchService class. Create a new instance of Match service, and use applyMatch methods to find matches

MatchService matchService = new MatchService();
Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);

Output

This prints the result to console. This should show a match between the 1st and 3rd document, but not the 2nd.

result.entrySet().forEach(entry -> {
    entry.getValue().forEach(match -> {
        System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
    });
});

Performance

For most real life data-sets, the size of the data I am sure is not as simple as shown in the examples.
Since this library can be used to match elements against a large set of records, knowing how it performs is essential.

The performance characteristics varies primarily on MatchType being used

The following chart shows the performance characteristics of this library as the number of elements increase. As you can see, the library maintains a near-linear performance and can match thousands of elements within seconds on a multi-core processor.

Perf