Levenshtein Distance


Introduction, example, and algorithm draft

If a word or a company name is misspelled, how to measure the distance with its reference?

For instance, how to build one mathematic model, amongst many others, determining that Goggle is closer to Google than to Amazon, Facebook, or Apple?

The Levenshtein distance will be treated with 3 posts:

  • Introduction of the Levenshtein distance definition, an example called « GAFA », and starting to explain the classical algorithm required for calculating the distance [this post]
  • The full algorithm and the full definition of the Levenshtein distance
  • Using R programming for solving the « GAFA » example with Levenshtein distance

A short definition of the Levenshtein distance

The Levenshtein distance measure the distance, as a metric, between two sequences (words, character strings…).

The result is provided as a whole number.

Indeed, the distance is the minimum number of single-characters edits, for going from one sequence to the other sequence, through 3 tools:

  • Inserting a character
  • Deleting a character
  • Substituting a character

A very simple example: the distance between car and cat is 1, as you only need to substitute the last letter of car, from r to t, in order to retrieve cat


The GAFA example

Consider we have received a Google Form survey result, where people were asked to enter their favorite GAFA, meaning choosing between Google, Apple, Facebook, and Amazon.

In an ideal scenario, a drop-down list is created, hereunder a survey creation example with a Google Form.

With this ideal form, no mistake is possible for the user:

But if no validation has been set up (due to mistake or because this functionality is unknow) in the form configuration, and if the answer is a free short text.

Then, misspelling is possible.

The failed-soft form proposed to the user:

Some examples of misspelling :

  • Gogglle
  • Google
  • Gougeule
  • Googe
  • Amazon
  • Amzone
  • Firstbok
  • Alpes
  • Aple

Remark: for this entire post, we won’t consider case-sensitive:

‘A’ is considered the same as ‘a’, ‘B’ is considered the same as ‘b’…

The question: after the result of the survey has been received, how to guess the relevant answer in case of misspelling?

Answer in the survey…… was probably meaning
GogglleGoogle
GoogleGoogle
GougeuleGoogle
GoogeGoogle
AmazonAmazon
AmzoneAmazon
FirstbokFacebook
AlpesApple
ApleApple

An intuitive application of the Levenshtein distance

Is the sequence » Gogglle » closer than the sequences « Google », « Apple », « Facebook », or « Apple »?

For a human, the answer is reasonably « Gogglle » is closer to « Google ». But how could a computer answer to that? It’s why we introduce the Levenshtein distance, which is one method amongst many.

Remember we have these hereunder 3 tools, and apply them:

  • Inserting a single-character
  • Deleting a single-character
  • Substituting a single-character
From SequenceTo SequenceSuccession of intuitive operations requiredIntuitive operations total
goggllegoogleSubstitute the third letter of gogglle: the result is googlle Delete the fifth letter of googlle: the result is google2
gogglleappleDelete the first letter of goggle: the result is ogglle Delete the first letter of ogglle: the result is gglle
Substitute the first letter of gglle: the result is aglle Substitute the second letter of aglle: the result is aplle Substitute the first third of aplle: the result is apple
5
goggllefacebookYou can now apply an intuitive method on your own… 
gogglleamazonYou can now apply an intuitive method on your own… 

We keep in mind:

  • The intuitive method provides a metric, as a whole number
  • The closer the distance, the less the metric

In this example, we note that the distance “gogglle-google” is 2 and lower than the distance “gogglle-apple” which is 5. We can reasonnably deduce that the surveyed person who has entered “gogglle” was meaning “google” and not “apple”.

From now, “gogglle-google” will be our own official notation for the distance between « gogglle » and « google »

With complex sequences, the intuitive application isn’t enough. There might be some mistake in the previous calculation. It’s why we need a realiable method.

Draft of a reliable algorithm calculating the Levenshtein distance

In order to follow a learning curve, we will grope during the assembling of this method.


Introducing the matrix shape into the method

Let’s take an example pointing out a pitfall in the intuitive distance calculation: what is the distance between “eval” and “valley”?

A very quick intuitive operation would be to apply:

FromToSuccession of intuitive operations requiredIntuitive operations total
evalvalleySubsitution of first letter of eval: result is vval
Subsitution of second letter of vval: result is vaal
Subsitution of third letter of vaal: result is vall
Insertion letter at the end of vall: result is valle
Insertion letter at the end of valle: result is valley
5

A more rigorous intuitive operation would be:

FromToIntuitive and rigorous operationIntuitive operations total
evalvalleyRemoving letter at the beginning of eval: the result is val
Inserting letter at the end of val: the result is vall
Insertingletter at the end of vall: the result is valle
Inserting letter at the end of valle: the result is valley
4

The trick is: the 3 tools shouldn’t be applied systematically by starting at both words. We have to compare every substring from the first word (beginning, between, end) to every substring to the second word(beginning, between, end), in order to find the right approach.

There are two consequences, with 4 impacts:

ConsequenceImpact of the consequence
The optimal distance can use subtrings anywhere (beginning, starting at second index, starting at third…) in the first word, and anywhere (beginning, starting at second index, starting at third…) in the second word: all these combinations must be checkWe will use a matrix shape: the abciss will reflect the first word, and the ordinate the second word
It is possible to break down the problem by removing the last character of the first sequence and removing the last character of the last sequence, and then treating the impact of adding one characters for each sequenceWe will use recursivity: we will compare the distance between two words, with the distance of these two words without their last letter

The matrix shape

A matrix will be applied as hereunder, for an example of finding the distance between “lalab” and “labo”. The global benefits are described hereunder


Intuitive application of the matrix shape

Remark: we start with an intuitive method, in order to get a progressive learning curve. You will state afterward, this is not very effective for the distance calculation. In the next post, we will learn to fill the matrix shape with the full-logic classical algorithm

Firstly, we decide to establish the distance between “lalalab” and « l », meaning to fill the first line. How to do it intuitively?

 LALAB
L?????
A     
B     
O     

From L to L, the distance is 0, because no modification is required:

 LALAB
L0????
A     
B     
O     

From “L” to “LA”, the distance is 1, because after “L” you insert “A” and get “LA”:

 LALAB
L01???
A     
B     
O     

From “L” to “LAL”, the distance is 2, because after “L” you insert “A” and “L”, and get “LAL”:

 LALAB
L012??
A     
B     
O     

From “L” to “LALA”, the distance is 3, because after “L” you insert “A” “L” and “A” and get “LALA”:

 LALAB
L0123?
A    
B     
O     

From “L” to “LALAB”, the distance is 24, because after “L” you insert “A” “L” “A” and “B” and get “LALAB”:

 LALAB
L01234
A     
B     
O     

Now, let’s fill the second line:

 LALAB
L01234
A?????
B     
O     

From “L” to “LA”, the distance is 1, because from “L” you add “A”, and get “LA”:

 LALAB
L01234
A1????
B     
O     

From “LA” to “LA”, the distance is 0, because these are the same strings:

 LALAB
L01234
A10???
B     
O     

From “LAL” to “LA”, the distance is 1, because from “LA” you add “L”, and get “LAL”:

 LALAB
L01234
A101??
B     
O     

From “LALA” to “LA”, the distance is 2, because from “LA” you add “L” “A”, and get “LALA”:

 LALAB
L01234
A1012?
B    
O     

From “LALAB” to “LA”, the distance is 3, because from “LA” you add “L” “A” “B”, and get “LALAB”:

 LALAB
L01234
A10123
B     
O     

Summary of the Levenshtein matrix

With the matrix, we can guess that we are covering every substring from the first word and every substring from the second word. Unhappily, the intuitive method will become inaccurate for the calculation of the difference. As a sample, “LABO” vs. “LALAB”. How to improve this algorithm draft?


The recursivity properties

Recursivity with same last letters

Let’s focus at this example:

 LALAB
L01234
A1????
B     
O     

The focus plot:

The recap:

  • For example, we focus on the distance « LA-LA »
  • It is the distance « L-L » + « A-A »
  • We note that “A” and « A » are the same letters, hence the distance « A-A » is equal to 0
  • Hence the distance from the cell [line “LA”; row “LA”] comes the cell [line “L”; row “L”] which is its up-left cell

If the letters are the same, we fetch the value from the upper left cell, which is the distance of the substrings without this same letter at the end:

Sum property with different last letters

Let’s focus at this example:

 LALAB
L01234
A1012?
B     

You notice the last letters are different.

The focus plot:

A recap:

  • For the example “LALAB-LA
  • We have 3 candidates as the minimum distance:
    • the distance « LALAB-L », plus the cost of « L » receiving at its end the insertion of « A« . Concretely the value of the cell [line “LALAB”; row “L”], which is the up cell, plus 1 which is the cost of insertion, is the first candidate for the “LALAB-LA” distance
    • the distance « LALA-LA« , plus the cost of « LALA » receiving at its end the insertion of « B« . Concretely the value of the cell [line “LALA”; row “LA”], which is the left cell, plus 1 which is the cost of insertion, is the second candidate for the “LALAB-LA” distance
    • the distance « LALA-L », plus the cost of subtituing « A » and « B« . Concretely the value of the cell [line “LALA”; row “L”], which is the up-left cell, plus 1 which is the cost of subtitution, is the first candidate for the “LALAB-LA” distance

Amonst these three candidates, how to choose the good one, with the minimum distance?

The distance, is by convention, and by practicality, the shortest path, hence the candidate with the lowest distance is the winner:

Conclusion of the Introduction, example, and algorithm draft

We have seen all the tools that will define the full algorithm of the Levenshtein calculation.

In the next post, we will fully detail and apply this algorithm.

In the last post, we will solve the GAFA example with R programming.


Application of the algorithm

Our example is to measure the Levenshtein distance between the word LABO and the company word LALAB, discovered in the previous post. They have been chosen, because it is difficult to intuitively guess their distance, due to common sub-strings.

Explanation of abscissa and ordinate:

LALAB is arbitrarily chosen as the abscissa word, as it will be written horizontally in the matrix

LABO is arbitrarily chosen as the ordinate word as it will be written vertically in the matrix

Remark: the Levenshtein distance is a commutative operation. It means the distance between LALAB and LABO is the same as the distance between LABO and LALAB. Hence, the abscissa and the ordinate could have been switched, and the result would have been the same.


Start of the classical algorithm example

Design a matrix with “LABO” as ordinate and “LALALAB” as abcissa:

Add an initial row “empty string” and fill it with a numeric incremental sequence starting with 0 in front of the first offset of the abscissa word:

Add an initial column “empty string” and fill it with a numeric incremental sequence starting with 0 in front of the first offset of the ordinate word:


Let’s proceed with the first row, in front of the first offset of the ordinate word

Rule

When the characters in ordinate and in abscissa are the same:

The value is inherited from the top-left cell

For the first cell, as the letters are the same, the ?  value comes from the top-left cell:

The result is:

Rule

When the characters in ordinate and in abscissa are different:

The value is inherited from the minimum of:

– Top cell

– Top left cell

– Left cell

Then add 1 to the result

Now the second offset of the abscissa word:

The result is:

Now the third offset of the abscissa word:

The result:

Now the fourth offset of the abscissa word:

The result:

Now the fifth offset of the abscissa word:

The result:


Let’s continue with the second row of the ordinate word:

First column:

Second column:

Third Column:

Fourth column:

Fifth column:


Let’s continue with the third row of the ordinate word:


Let’s continue with the fourth and last row of the ordinate word:


Analyzing the result of the applied algorithm


First information: what is the Levenshtein distance?

It is in the bottom-right cell: the distance is 3


Second information:  how to build the path leading from LABO to LALAB?

Summary of the applied algorithm

All possible solutions are found by building the path from the top-left cell to the bottom-right cell, cell by cell, with the following rules:

  • each step must be going right, bottom, or diagonally bottom-right
  • at each step, increase the distance, or at least keep it constant
  • bottom-right diagonal has priority

Hereunder one of the possible optimum paths:


Application of the path to our example: explain how to go from LABO to LALAB

Reminder:

The three possibles tools with the Levenshtein distance are:

Insertion of character

Deletion of character

Substitution of character

Let’s start with the string LABO.

The hereunder red arrow indicates:

  • the distance has not changed
  • the ordinate & abscissa destination letters are the same
  • hence the first letter is kept
  • LABO has not changed: the result is still LABO

The hereunder red arrow indicates:

  • the distance has not changed
  • the ordinate & abscissa destination letters are the same
  • hence the second letter is kept
  • LABO has not changed: the result is still LABO

The hereunder red arrow indicates:

  • the distance has increased: +1
  • the ordinate & abscissa destination letters are different
  • this is a « go right » arrow
  • hence L has been inserted at the third offset
  • LABO has changed to LALBO

The hereunder red arrow indicates:

  • the distance has increased: +1
  • the ordinate & abscissa destination letters are different
  • this is a « go down & right » arrow
  • hence B has been replaced by A at the fourth offset
  • LALBO has changed to LALAO

The hereunder red arrow indicates:

  • the distance has increased: +1
  • the ordinate & abscissa destination letters are different
  • this is a « go down & right » arrow
  • hence O has been replaced by B at the fourth offset
  • LALAO has changed to LALAB

Rules for building the path

We just have studied the example. Now let’s describe it in generic terms.

First, we note there are three possible « steps »:

Finally, the path rules can be summarized into 4 scenarios:

Scenario
characteristic :
Scenario characteristic :  Result
Are the destination abscissa character and the destination ordinate character?Type of arrow?The operation applied to the current offset of the last version of the ordinate sequence
SameNo modification
DifferentAfter the origin ordinate offset, insertion of the offset from the abscissa destination offset
DifferentAt the origin ordinate offset, substitution with the destination from the abscissa offset
DifferentDeletion of the origin ordinate offset

The Levenshtein distance definition

Let’s define what has just been applied.

The distance between a sequence A and a sequence B is:

In this caseThen the distance isExplanation
If sequence A is emptyThen the distance is the length of sequence BBecause you remove successively all offsets of sequence B in order to transform it as empty
If sequence B is emptyThen the distance is the length of sequence ABecause you remove successively all offsets of sequence A in order to transform it as empty
If the last offset of B and the last offset of A are the sameThen the distance is the same as “A minus its last offset” with “B minus its last offset”Because no transformation is needed
In all other casesThen the distance is the minimum of these 3 distances:
– “A minus its last offset” with “B”
– “A” with “B minus its last offset”
– “A minus its last offset” with  “B minus its last offset”





And then add 1
Break down the problem and study these 3 scenarios; remove one character to the last offset of :
– sequence A but not sequence B
– sequence B but not sequence A
– both sequences

For each of the 3 scenarios, the simplification has reduced the distance by 1.

Between the 3 scenarios, we choose the one with the minimum value, as the distance is defined by the shortest patch

Indeed, a mathematical definition is described in Wikipedia:

The Levenshtein distance between two strings a,b (of length |a| and |b| respectively) is given by lev(a,b) where

where the tail of some string x is a string of all but the first character of x, and x[n] is the n nth character of the string x, starting with character 0.

https://en.wikipedia.org/wiki/Levenshtein_distance#Definition

Remark:

  • The easy definition is based on « head », meaning applying recursivity on the string minus the last character.
  • The Wikipedia definition is based on « tail », meaning applying recursivity on the string minus the first character.
  • Due to recursivity property, this is basically the same process.

Conclusion of the application of the algorithm

We were able to determine the distance between the strings LABO and LALAB. Also, we have explained how to convert step-by-step LABO to LALAB with the Levenshtein tools.

The next post will detail the calculation of Levenshtein distances with R programming between:

  • LABO and LALAB, the example of this post
  • And to determine from the examples, for each of the hereunder misspellings, between GOOGLE, APPLE, FACEBOOK, and AMAZON, which one is the most probable answer according to the Levenshtein distance:
    • Gogglle
    • Google
    • Gougeule
    • Googe
    • Amazon
    • Amzone
    • Firstbok
    • Alpes
    • Aple

R Language programming example

We will apply R programming and recalculate this distance between LALO and LALAB. Moreover, we want to answer to the introduction: according to the Levenshtein metric, were the hereunder words meant to indicate GOOGLE, AMAZON, FACEBOOK, or APPLE?

  • Gogglle
  • Google
  • Gougeule
  • Googe
  • Amazon
  • Amzone
  • Firstbok
  • Alpes
  • Aple

Choice of the R package for calculating the Levenshtein distance, and the first example

The classical algorithm which solves the Levenshtein distance is analytic: it means it will provide the exact solution. For example, the distance between LABO and LALAB is known and is exactly 3.

Indeed, according to some other data science algorithms, the Levenshtein one is not complex.

The function stringdist from the package stringdist is arbitrarily chosen, and will very soon be explained.

The hereunder example provides the Levenshtein distance between the word « LABO » and « LALAB ». Please note the function stringdist can calculate many types of distance, hence the parameter method = ‘lv’ is added

First, we need to declare our working environment

```{r}
# We call the package "stringdist" in order to calculate the Levenshtein distances
library(stringdist)
# A small test with the library
test_distance <- stringdist("LABO", "LALAB", method = "lv")
# Let's display the distance and check if it is 3 according to the previous post
test_distance
```
This image has an empty alt attribute; its file name is image.png
The result is as expected

Solving the whole list with R programming

We are ready to solve the entire list:

  • Gogglle
  • Google
  • Gougeule
  • Googe
  • Amazon
  • Amzone
  • Firstbok
  • Alpes
  • Aple
```{r}
# we declare data.table as usual
library(data.table)
#This vector is the list of words from the survey: the aim is to determine if
# they are closer to "Google", "Apple", "Facebook" or "Amazon"
# according to the Levenshtein distance
WORD <- c(
'Gogglle',
'Google',
'Gougeule',
'Googe',
'Amazon',
'Amzone',
'Firstbok',
'Alpes',
'Aple'
)
# We store the words in a data table
dt_survey <- data.table(WORD)
# Remember we have decided to be case-unsensitive
dt_survey <- dt_survey[ , WORD_LOWER_CASE := tolower(WORD) ]
head(dt_survey)
```
This image has an empty alt attribute; its file name is image-1.png
The list has been prepared

Now we calculate the distance of each word to each possible closer word « apple », « google », « amazon » and « facebook »

```{r}
# We keep dt_survey in a safe area, and create dt_survey2 as working area
dt_survey2 <- dt_survey
# Calculating the distance between the answer with Google company
dt_survey2 <- dt_survey2[, DIST_GOOGLE := stringdist(ANSWER, "google", method = "lv")]
# Calculating the distance between the answer with Apple company
dt_survey2 <- dt_survey2[, DIST_APPLE := stringdist(ANSWER, "apple", method = "lv")]
# Calculating the distance between the answer with Google company
dt_survey2 <- dt_survey2[, DIST_FACEBOOK := stringdist(ANSWER, "facebook", method = "lv")]
# Calculating the distance between the answer with Apple company
dt_survey2 <- dt_survey2[, DIST_AMAZON := stringdist(ANSWER, "amazon", method = "lv")]
head(dt_survey2)
```
This image has an empty alt attribute; its file name is image-2.png
For example, with the first row:
The Levenshtein distance between gogglle and google is 2
The Levenshtein distance between gogglle and apple is 5
The Levenshtein distance between gogglle and facebook is 8
The Levenshtein distance between gogglle and amazon is 7
The result in an intermediary data.table of the first rows

It is time for comparing the shortest distance for each word:

```{r}
# We keep dt_survey2 in a safe area, and create dt_survey3 as working area
dt_survey3 <- copy(dt_survey2)
# Find the minimum distance between the answer and the four company names
dt_survey3 <- dt_survey3[, MIN_DIST := pmin( DIST_GOOGLE, DIST_APPLE, DIST_FACEBOOK, DIST_AMAZON ) ]
# Populate the winner(s) for each possible scenario
dt_survey3 <- dt_survey3[, WINNER_GOOGLE := ifelse (DIST_GOOGLE == MIN_DIST, "google" ,"") ]
dt_survey3 <- dt_survey3[, WINNER_APPLE := ifelse (DIST_APPLE == MIN_DIST, "apple" ,"") ]
dt_survey3 <- dt_survey3[, WINNER_FACEBOOK := ifelse (DIST_FACEBOOK == MIN_DIST, "facebook" ,"") ]
dt_survey3 <- dt_survey3[, WINNER_AMAZON := ifelse (DIST_AMAZON == MIN_DIST, "amazon" ,"") ]
## Summarize the winner(s) in one single column
dt_survey3 <- dt_survey3[, CLOSEST_GAFA := paste(WINNER_GOOGLE, WINNER_APPLE, WINNER_FACEBOOK, WINNER_AMAZON, sep = " " )]
head(dt_survey3)
```
This image has an empty alt attribute; its file name is image-4-1024x141.png

Explanation with the first row, which initial WORD is « gogglle »:
The distances, respectively with google, apple, facebook, and amazon are: 2, 5, 8, 7
Hence the minimum distance MIN_DIST is 2
This MIN_DIST is compared to the 4 distances: 2, 5, 8,7
If equal, it means it is the closest distance
For instance:
MIN_DIST is equal to DIST_GOOGLE: hence WINNER_GOOGLE is populated with « google »
MIN_DIST is not equal to DIST_APPLE: hence WINNER_APPLE is empty
MIN_DIST is not equal to DIST_FACEBOOK: hence WINNER_FACEBOOK is empty
MIN_DIST is not equal to DIST_AMAZON: hence WINNER_AMAZON is empty
CLOSEST_GAFA is filled with the concatenation of WINNER_GOOGLE, WINNER_FACEBOOK, WINNER_AMAZON, WINNER_AMAZON, with « space » as separator: the result is « google »
Note, a tie would be taken into account: CLOSEST_GAFA would contain the two or more winner thanks to the concatenation

As a summary, the picture represents the head of the data.table, and the last column CLOSEST_GAFA provides the winner or winners in case of a tie

As the data.table is crowded, we clean the result and display the entire data.table:

```{r}
# Let's display a clean result
dt_survey_final <- copy(dt_survey3)
dt_survey_final <- dt_survey_final[, c("WORD", "CLOSEST_GAFA")]
dt_survey_final
```
This image has an empty alt attribute; its file name is image-5.png

Conclusion of the R programming example

With R language, we have determined that « Gogglle » is closer to « Google » rather than « Amazon », « Facebook » or « Apple ».

The Levenshtein is a very simple distance for estimating the distance between two strings.

And as a corollary, it is useful for determining in a predefined list, which string is the closest to a reference string.

As the algorithm is easy to develop, you can find some Excel examples in order to calculate the Levenshtein distance:

https://www.developpez.net/forums/d1350898/logiciels/microsoft-office/excel/macros-vba-excel/fonction-distance-levenshtein-sous-excel/

This function example is interesting, as it provides the distance not absolute, but relative, as a percentage.