Techno Blender
Digitally Yours.

How to Do Fuzzy String Matching in Pandas Dataframes

0 40


Photo by Lucas Santos on Unsplash

The real world is not perfect.

People use different forms of the same piece of information. Even well-established systems use different standards. You’d have seen city names misspelled, like “Santana” instead of “Santa Ana” or “St. Louie” instead of “St. Louis.”

When working with real-world data, this is inevitable. Thus we must ensure that the data we take to the next steps in our pipeline are standardized.

We can tackle this problem with Fuzzy string matching. This, too, is not perfect, yet very helpful.

Fuzzy String Matching in Python

You’d probably use Pandas dataframes for wrangling if you’re a Python programmer. Along with pandas, you could use “thefuzz” to do fuzzy string matching.

pip install thefuzz

TheFuzz is an open-source Python package formally known as “FuzzyWuzzy.” It uses the Levenshtein edit distance to calculate the similarity string similarity.

Here’s the basic usage of it:

from thefuzz import fuzz, process

process.extractBests(
"my precious",
[
"My brushes",
"my purses",
"my prices",
"me priceless",
"my prcios",
"My bruises",
"My praises",
"My precursors",
"My process",
"My princess",
"My progresses",
"My prospects",
"My producers",
"My precisions",
"My presuppositions",
],
)

# Output
>> [('my prcios', 90),
('My presuppositions', 86),
('My precisions', 83),
('My precursors', 75),
('my purses', 70)]

process.extractOne(
"my precious",
[
...
],
)

# Output
>> ('my prcios', 90)

How text similarity is measured?

The Levenshtein distance, also known as edit distance, is a metric used to measure the difference or similarity between two strings. It calculates the minimum number of operations (insertions, deletions, or substitutions) required to transform one string into another.

The smaller the Levenshtein distance between two strings, the more similar they are.

For example, consider the two strings “chat” and “chart”. The Levenshtein distance between them is 1 because the only operation needed to transform “chat” into “chart” is to replace the letter “a” with “r”.

Now Consider the strings “intention” and “execution”. The Levenshtein distance between them is 5. The following is one of the possible ways to transform “intention” into “execution” with a minimum number of operations:

  1. Replace ‘i’ with ‘e’: entention
  2. Replace ’n’ with ‘x’: extention
  3. Delete ‘t’: exention
  4. Replace ‘n’ with ‘u’: exection
  5. Insert ‘u’: execution

The total cost of these operations is 5, which is the Levenshtein distance between the two strings.

Generally, the smaller the Levenshtein distance between two strings, the more similar they are. For instance, if two strings have a Levenshtein distance of 0, then they are identical. Conversely, if the distance is large, it indicates that the strings are significantly different.

Please refer to the Educative article on Levenshtein distance, as it’s not the scope of this post.

But here’s a short example of how we can get a similarity score for two strings in Python.

# Example 2: Calculate Jaccard similarity
ratio = fuzz.ratio("apple", "banana")
print(ratio) # Output: 18

# Example 3: Calculate cosine similarity
cosine_sim = fuzz.token_sort_ratio("apple", "banana")
print(cosine_sim) # Output: 18

# Example 4: Calculate partial ratio
partial_ratio = fuzz.partial_ratio("apple", "banana")
print(partial_ratio) # Output: 20

# Example 5: Calculate token set ratio
token_set_ratio = fuzz.token_set_ratio("apple is a fruit", "a fruit is an apple")
print(token_set_ratio) # Output: 100

Using Fuzzy string matches to remove duplicates in Pandas dataframes

One of the common challenges when working with user-created data is to remove duplicates. But the task becomes challenging when the duplicates aren’t exact matches.

We can write a little function to check string similarity and remove duplicates. Here’s an example:

import pandas as pd
from thefuzz import fuzz, process

data = {
"Name": ["John Smith", "Jon Smtih", "Jane Doe", "James Johnsan", "Janes Johnson"],
"Age": [25, 25, 30, 40, 40],
"Gender": ["M", "M", "F", "M", "M"],
}

df = pd.DataFrame(data)

display(df)

# Output
| | Name | Age | Gender |
|---:|:--------------|------:|:---------|
| 0 | John Smith | 25 | M |
| 1 | Jon Smtih | 25 | M |
| 2 | Jane Doe | 30 | F |
| 3 | James Johnsan | 40 | M |
| 4 | Janes Johnson | 40 | M |

def compare_strings(a, b):
return fuzz.token_sort_ratio(a, b)

def remove_duplicates(df, threshold=90):
duplicates = set()
processed = []

for i, row in df.iterrows():
if i not in duplicates:
processed.append(row)

for j, other_row in df.iterrows():
if i != j and j not in duplicates:
score = compare_strings(row["Name"], other_row["Name"])

if score >= threshold:
duplicates.add(j)

return pd.DataFrame(processed)

remove_duplicates(df, threshold=80)

# Output
| | Name | Age | Gender |
|---:|:--------------|------:|:---------|
| 0 | John Smith | 25 | M |
| 2 | Jane Doe | 30 | F |
| 3 | James Johnsan | 40 | M |

Our original dataset had similar names — John Smith and Jon Smith, James Johnson, and James Johnsan. But we were able to get rid of these duplicates.

Standardizing fuzzy duplicates in a Pandas dataframe

Here’s an implementation of a function that replaces duplicate rows in the “Name” column with the first occurrence of that row, based on a similarity score threshold using thefuzz package:

def replace_duplicates(df, column_name='Name', threshold=90):
processed = []
duplicates = set()
first_occurrence = {}

for i, row in df.iterrows():
row_text = row[column_name]

if i not in duplicates:
processed.append(row)
first_occurrence[row_text] = i

for j, other_row in df.iterrows():
if i != j and j not in duplicates:
other_text = other_row[column_name]
score = fuzz.token_set_ratio(row_text, other_text)

if score >= threshold:
duplicates.add(j)
first_occurrence[other_text] = i

return df.iloc[list(first_occurrence.values())]

We can remove the Jon Smiths with John Smiths in our dataset using this function.

replace_duplicates(df, threshold=80)

# Output
| | Name | Age | Gender |
|---:|:--------------|------:|:---------|
| 0 | John Smith | 25 | M |
| 0 | John Smith | 25 | M |
| 2 | Jane Doe | 30 | F |
| 3 | James Johnsan | 40 | M |
| 3 | James Johnsan | 40 | M |

Conclusion

Fuzzy string matching is a must for real-life data. Unless you autogenerate data, you almost always expect non-standard values in your dataset. Even among autogenerated systems, the conventions may vary.

When text data involved, I often see we stuck at some point where there are fuzzy matching is required.

In such situations, we can use the techniques highlighted in this post. We’ve used the Python package “thefuzz” to match strings using Levenshtein’s distance and removed duplicates from Pandas dataframes. We can replace them with one proper value if needed instead of removing them.

But fuzzy string matching is not perfect. For instance, in the last example of replacing duplicates, our script has replaced all “James Johnson” with “James Johnsan.” If Johnson is what we prefer, our script hasn’t done a good job.

Thus, we need to use fuzzy matching as a last resort. Or a helpful guide. But relying on them too much is not advisable.

I hope this helps.


Photo by Lucas Santos on Unsplash

The real world is not perfect.

People use different forms of the same piece of information. Even well-established systems use different standards. You’d have seen city names misspelled, like “Santana” instead of “Santa Ana” or “St. Louie” instead of “St. Louis.”

When working with real-world data, this is inevitable. Thus we must ensure that the data we take to the next steps in our pipeline are standardized.

We can tackle this problem with Fuzzy string matching. This, too, is not perfect, yet very helpful.

Fuzzy String Matching in Python

You’d probably use Pandas dataframes for wrangling if you’re a Python programmer. Along with pandas, you could use “thefuzz” to do fuzzy string matching.

pip install thefuzz

TheFuzz is an open-source Python package formally known as “FuzzyWuzzy.” It uses the Levenshtein edit distance to calculate the similarity string similarity.

Here’s the basic usage of it:

from thefuzz import fuzz, process

process.extractBests(
"my precious",
[
"My brushes",
"my purses",
"my prices",
"me priceless",
"my prcios",
"My bruises",
"My praises",
"My precursors",
"My process",
"My princess",
"My progresses",
"My prospects",
"My producers",
"My precisions",
"My presuppositions",
],
)

# Output
>> [('my prcios', 90),
('My presuppositions', 86),
('My precisions', 83),
('My precursors', 75),
('my purses', 70)]

process.extractOne(
"my precious",
[
...
],
)

# Output
>> ('my prcios', 90)

How text similarity is measured?

The Levenshtein distance, also known as edit distance, is a metric used to measure the difference or similarity between two strings. It calculates the minimum number of operations (insertions, deletions, or substitutions) required to transform one string into another.

The smaller the Levenshtein distance between two strings, the more similar they are.

For example, consider the two strings “chat” and “chart”. The Levenshtein distance between them is 1 because the only operation needed to transform “chat” into “chart” is to replace the letter “a” with “r”.

Now Consider the strings “intention” and “execution”. The Levenshtein distance between them is 5. The following is one of the possible ways to transform “intention” into “execution” with a minimum number of operations:

  1. Replace ‘i’ with ‘e’: entention
  2. Replace ’n’ with ‘x’: extention
  3. Delete ‘t’: exention
  4. Replace ‘n’ with ‘u’: exection
  5. Insert ‘u’: execution

The total cost of these operations is 5, which is the Levenshtein distance between the two strings.

Generally, the smaller the Levenshtein distance between two strings, the more similar they are. For instance, if two strings have a Levenshtein distance of 0, then they are identical. Conversely, if the distance is large, it indicates that the strings are significantly different.

Please refer to the Educative article on Levenshtein distance, as it’s not the scope of this post.

But here’s a short example of how we can get a similarity score for two strings in Python.

# Example 2: Calculate Jaccard similarity
ratio = fuzz.ratio("apple", "banana")
print(ratio) # Output: 18

# Example 3: Calculate cosine similarity
cosine_sim = fuzz.token_sort_ratio("apple", "banana")
print(cosine_sim) # Output: 18

# Example 4: Calculate partial ratio
partial_ratio = fuzz.partial_ratio("apple", "banana")
print(partial_ratio) # Output: 20

# Example 5: Calculate token set ratio
token_set_ratio = fuzz.token_set_ratio("apple is a fruit", "a fruit is an apple")
print(token_set_ratio) # Output: 100

Using Fuzzy string matches to remove duplicates in Pandas dataframes

One of the common challenges when working with user-created data is to remove duplicates. But the task becomes challenging when the duplicates aren’t exact matches.

We can write a little function to check string similarity and remove duplicates. Here’s an example:

import pandas as pd
from thefuzz import fuzz, process

data = {
"Name": ["John Smith", "Jon Smtih", "Jane Doe", "James Johnsan", "Janes Johnson"],
"Age": [25, 25, 30, 40, 40],
"Gender": ["M", "M", "F", "M", "M"],
}

df = pd.DataFrame(data)

display(df)

# Output
| | Name | Age | Gender |
|---:|:--------------|------:|:---------|
| 0 | John Smith | 25 | M |
| 1 | Jon Smtih | 25 | M |
| 2 | Jane Doe | 30 | F |
| 3 | James Johnsan | 40 | M |
| 4 | Janes Johnson | 40 | M |

def compare_strings(a, b):
return fuzz.token_sort_ratio(a, b)

def remove_duplicates(df, threshold=90):
duplicates = set()
processed = []

for i, row in df.iterrows():
if i not in duplicates:
processed.append(row)

for j, other_row in df.iterrows():
if i != j and j not in duplicates:
score = compare_strings(row["Name"], other_row["Name"])

if score >= threshold:
duplicates.add(j)

return pd.DataFrame(processed)

remove_duplicates(df, threshold=80)

# Output
| | Name | Age | Gender |
|---:|:--------------|------:|:---------|
| 0 | John Smith | 25 | M |
| 2 | Jane Doe | 30 | F |
| 3 | James Johnsan | 40 | M |

Our original dataset had similar names — John Smith and Jon Smith, James Johnson, and James Johnsan. But we were able to get rid of these duplicates.

Standardizing fuzzy duplicates in a Pandas dataframe

Here’s an implementation of a function that replaces duplicate rows in the “Name” column with the first occurrence of that row, based on a similarity score threshold using thefuzz package:

def replace_duplicates(df, column_name='Name', threshold=90):
processed = []
duplicates = set()
first_occurrence = {}

for i, row in df.iterrows():
row_text = row[column_name]

if i not in duplicates:
processed.append(row)
first_occurrence[row_text] = i

for j, other_row in df.iterrows():
if i != j and j not in duplicates:
other_text = other_row[column_name]
score = fuzz.token_set_ratio(row_text, other_text)

if score >= threshold:
duplicates.add(j)
first_occurrence[other_text] = i

return df.iloc[list(first_occurrence.values())]

We can remove the Jon Smiths with John Smiths in our dataset using this function.

replace_duplicates(df, threshold=80)

# Output
| | Name | Age | Gender |
|---:|:--------------|------:|:---------|
| 0 | John Smith | 25 | M |
| 0 | John Smith | 25 | M |
| 2 | Jane Doe | 30 | F |
| 3 | James Johnsan | 40 | M |
| 3 | James Johnsan | 40 | M |

Conclusion

Fuzzy string matching is a must for real-life data. Unless you autogenerate data, you almost always expect non-standard values in your dataset. Even among autogenerated systems, the conventions may vary.

When text data involved, I often see we stuck at some point where there are fuzzy matching is required.

In such situations, we can use the techniques highlighted in this post. We’ve used the Python package “thefuzz” to match strings using Levenshtein’s distance and removed duplicates from Pandas dataframes. We can replace them with one proper value if needed instead of removing them.

But fuzzy string matching is not perfect. For instance, in the last example of replacing duplicates, our script has replaced all “James Johnson” with “James Johnsan.” If Johnson is what we prefer, our script hasn’t done a good job.

Thus, we need to use fuzzy matching as a last resort. Or a helpful guide. But relying on them too much is not advisable.

I hope this helps.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment