How to Identify Fuzzy Duplicates in Your Tabular Dataset | by Avi Chawla | Mar, 2023

By Jessie Hobb On Mar 28, 2023

Effortless data deduplication at scale.

Photo by Sangga Rima Roman Selia on Unsplash

In today’s data-driven world, the importance of high-quality data to build quality systems cannot be overstated.

The availability of reliable data is highly critical for teams to make informed decisions, develop effective strategies, and gain valuable insights.

However, at times, the quality of this data gets compromised by various factors, one of which is the presence of fuzzy duplicates.

A set of records are fuzzy duplicates when they look similar but are not 100% identical.

For instance, consider the two records:

Fuzzy duplicates example (Image by Author)

In this example, the two records have similar but not identical values for both the name and address fields.

How do we get duplicates?

Duplicates can arise due to various reasons, such as misspellings, abbreviations, variations in formatting, or data entry errors.

These, at times, can be challenging to identify and address, as they may not be immediately apparent. Thus, they may require sophisticated algorithms and techniques to detect.

Implications of duplicates

Fuzzy duplicates can pose significant implications on data quality. This is because they result in inaccurate or incomplete analysis and decision-making.

For instance, if your dataset contains fuzzy duplicates, and you analyze it, you may end up overestimating or underestimating certain variables. This will lead to flawed conclusions.

Having understood the importance of the problem, in this blog post, let’s understand how you can perform data deduplication.

Let’s begin 🚀!

Imagine you have a dataset with over a million records that may contain some fuzzy duplicates.

The simplest yet intuitive approach that many often come up with involves comparing every pair of records.

However, this quickly gets infeasible as the size of your dataset grows.

For instance, if you have a million records (10⁶), by following the naive approach, you would have to perform over 10¹² comparisons (n²), as shown below:

def is_duplicate(record1, record2):
## function to determine whether record1 and record2
## are similar or not.
...for record1 in all_records:
for record2 in all_records:
result = is_duplicate(record1, record2)

Even if we assume a decent speed of 10,000 comparisons per second, it will take roughly three years to complete.

CSVDedupe is an ML-based open-source command-line tool that identifies and removes duplicate records in a CSV file.

One of its key features is blocking, which drastically improves the run-time of deduplication.

For instance, if you are finding duplicates in names, the approach suggests that comparing the name “Daniel” to “Philip” or “Shannon” to “Julia” makes no sense. They are guaranteed to be distinct records.

In other words, two duplicates will always have some common lexical overlap. However, the naive approach still compares them.

Using blocking, CSVDedupe groups records into smaller buckets and only performs comparisons between them.

This is an efficient way to reduce the number of redundant comparisons, as it is unlikely that records in different groups will be duplicates.

For example, one grouping rule could be to check if the first three letters of the name field are the same.

In that case, records with different first three letters in their name field would be in different groups and would not be compared.

Blocking using CSVDedupe (Image by Author)

However, records with the same first three letters in their name field would be in the same block, and only those records would be compared to each other.

This saves us from many redundant comparisons, which are guaranteed to be non-duplicates, like “John” and “Peter.”

CSVDedupe uses active learning to identify these blocking rules.

Let’s now look at a demo of CSVDedupe.

Install Dedupe

To install CSVDedupe, run the following command:

Active learning step of CSVDedupe (Image by Author)