x50 Faster Fuzzy Matching

Stop using Levenshtein distance, use cosine similarity on character n-gram vectors

Luis Garcia Fuentes
6 min readOct 11, 2022

FuzzyWuzzy background and its issues

A common practical data science problem is the pairing of names that represent the same object but are written differently. This is often referred to as approximate string matching or fuzzy matching.

A common method to pair these names is by calculating the Levenshtein distance between a name and all potential names that it could be paired with. Where the “distance” is increased the more changes a string needs to undergo in order to become another. Changes can include the substitution of an element of the string, the insertion, or the deletion.

https://devopedia.org/levenshtein-distance

Different implementations exist, but ultimately they all rely on a loop-like architecture, wherein for each loop, an additional change is performed, counted, and then compared against other options’ final tally.

You may already know that loops are an inefficient data science method, as we prefer vectorized operations. For this reason, despite its popular use, FuzzyWuzzy, which is based on computation via loops, should be avoided when pair matching at scale.

Cosine similarity over text vectors

Representing strings as numbers

Before proposing the implementation of cosine similarity, we need to remember how strings can be represented as vectors.

For clarity, names (product name, person name, company name, etc) will be described as documents. This is common NLP practice.

If you have a pd.Series containing a list of names, you can create a list of all used words. This list is referred to as a “bag of words”, and contains all available words appearing across your documents.

When we use each word in the bag of words as a separate column, and we let each row be a document, we can create a matrix that pairs the document with all available words in the vocabulary.

A diagram to support understanding of the word count matrix for those unfamiliar

If at each intersection of the document and word we mark the number of times the word in question (determined by column) appears in the document in question (determined by row) we obtain a word count matrix.

In this way, documents “the red dog”, “cat eats dog”, “dog eats food” and “red cat eats” can be represented in a count word matrix as shown below:

At this point, we have been able to represent our documents in terms of numbers. While we are not quite yet done, you should already get a feeling that this method will be more computationally efficient than a method leveraging the Levenshtein distance, as computers are notoriously efficient at processing numbers.

Representing strings as vectors

It turns out that if we break down our table into rows, we have a vector representation for each of our documents (flip them for vector notation convention).

Given that our example so far has a vocabulary of size 6, the vectors created are of 6 dimensions. Notice how each vector's contents are a mirror of the contents in the word count matrix.

Before moving forward we need to remember that vectors can also be represented in space. For example, the vector [-2 3] implies that we are moving -2 on our x-axis and 3 on our y-axis.

https://www.youtube.com/watch?v=fNk_zzaMoSs&ab_channel=3Blue1Brown

This graphical representation will allow us to better understand how two documents can be closer in space than another pair of documents, as we can now “plot” documents in space in the form of vectors.

By plotting documents in space via their vector, we can find how distant these documents are from each other based on the angle that the vectors create when measured from the axis’ origin.

Let's assume our vocabulary only includes the words “Red”, “Dog” and “Food”. In this case, then the document “Food Dog” will represent a vector where we move 0 units in the “Red” direction, 1 unit in the “Dog” direction, and 1 unit in the “Food” direction.

For example, in the diagram have 3 dimensions. Where each dimension represents a word.

Once we have plotted the documents in our vector space (“Red Dog” and “Food Dog” in our example), we can calculate how far these are from each other based on the measurement of the angle created by the documents’ vectors. In particular, we measure their distance using the cosine of their angle θ Theta.

The calculation of the Cos(θ) is computationally efficient because it can be calculated as the dot product of the two vectors divided by the product of the magnitudes of the two vectors. You don't have to understand the details of this calculation, just know that computers are really good at performing matrix algebra, and hence, still beating the loop implementation.

However, we have yet to build an intuition of how we can determine what documents are good matches vs what documents are not. We can formalize this intuition by assuming that a new document appears; “medicine medicine”.

“Medicine medicine” document counts with a vector that travels 2 units in the “Medicine” dimension, and zero in the “Red”, “Food” or “Dog” dimensions.

It is clear that the angle generated by the pairing of this new vector with the other already existing vectors is larger than the angle generated by the prior vectors “Red Dog” and “Food Dog”, and for this reason, cosine similarity will suggests that “Red Dog” and “Food Dog” are similar names, while “Medince medicine” is not similar to either.

An additional benefit of the cosine similarity metric is that its value is intuitive to our mental framework of correlation; as the cosine(180°) is -1 while the cosine(0) = 1.

Cosine similarity over character n-grams

You may have noticed that the solution proposed so far does not really account for typos, different valid ways of spelling a name, or other differences that Levenshtein distance would have captured.

However, we can account for this by breaking our documents into character n-grams as supposed to words, meaning, we create a “Bag of Words” using equal lengths sub-strings of the document.

For example, the character n-grams of n equal 2 for “Food Dog”, would be [Fo, od, D,og].

We can apply the same steps explained so far to a bag of words composed of character n-grams as supposed to just words, and we will capture some of the differences caused by typos and spelling errors.

Additionally, while this method may not be as accurate as the Levenshtein distance method, it is less likely to “over-fit” to common accruing words across your document.

Python Implementation

For a detailed description please visit Mala Deep’s blog:

To read the original paper that proposed this methodology:

https://www.researchgate.net/publication/323901766_Identifiying_Similar_Sentences_by_Using_N-Grams_of_Characters

--

--