Techno Blender
Digitally Yours.

How to Identify Fuzzy Duplicates in Your Tabular Dataset | by Avi Chawla | Mar, 2023

0 50


Photo by Sangga Rima Roman Selia on Unsplash

In today’s data-driven world, the importance of high-quality data to build quality systems cannot be overstated.

The availability of reliable data is highly critical for teams to make informed decisions, develop effective strategies, and gain valuable insights.

However, at times, the quality of this data gets compromised by various factors, one of which is the presence of fuzzy duplicates.

A set of records are fuzzy duplicates when they look similar but are not 100% identical.

For instance, consider the two records:

Fuzzy duplicates example (Image by Author)

In this example, the two records have similar but not identical values for both the name and address fields.

How do we get duplicates?

Duplicates can arise due to various reasons, such as misspellings, abbreviations, variations in formatting, or data entry errors.

These, at times, can be challenging to identify and address, as they may not be immediately apparent. Thus, they may require sophisticated algorithms and techniques to detect.

Implications of duplicates

Fuzzy duplicates can pose significant implications on data quality. This is because they result in inaccurate or incomplete analysis and decision-making.

For instance, if your dataset contains fuzzy duplicates, and you analyze it, you may end up overestimating or underestimating certain variables. This will lead to flawed conclusions.

Having understood the importance of the problem, in this blog post, let’s understand how you can perform data deduplication.

Let’s begin 🚀!

Imagine you have a dataset with over a million records that may contain some fuzzy duplicates.

The simplest yet intuitive approach that many often come up with involves comparing every pair of records.

However, this quickly gets infeasible as the size of your dataset grows.

For instance, if you have a million records (10⁶), by following the naive approach, you would have to perform over 10¹² comparisons (), as shown below:

def is_duplicate(record1, record2):
## function to determine whether record1 and record2
## are similar or not.
...

for record1 in all_records:
for record2 in all_records:
result = is_duplicate(record1, record2)

Even if we assume a decent speed of 10,000 comparisons per second, it will take roughly three years to complete.

CSVDedupe is an ML-based open-source command-line tool that identifies and removes duplicate records in a CSV file.

One of its key features is blocking, which drastically improves the run-time of deduplication.

For instance, if you are finding duplicates in names, the approach suggests that comparing the name “Daniel” to “Philip” or “Shannon” to “Julia” makes no sense. They are guaranteed to be distinct records.

In other words, two duplicates will always have some common lexical overlap. However, the naive approach still compares them.

Using blocking, CSVDedupe groups records into smaller buckets and only performs comparisons between them.

This is an efficient way to reduce the number of redundant comparisons, as it is unlikely that records in different groups will be duplicates.

For example, one grouping rule could be to check if the first three letters of the name field are the same.

In that case, records with different first three letters in their name field would be in different groups and would not be compared.

Blocking using CSVDedupe (Image by Author)

However, records with the same first three letters in their name field would be in the same block, and only those records would be compared to each other.

This saves us from many redundant comparisons, which are guaranteed to be non-duplicates, like “John” and “Peter.”

CSVDedupe uses active learning to identify these blocking rules.

Let’s now look at a demo of CSVDedupe.

Install Dedupe

To install CSVDedupe, run the following command:

And done! We can now move to experimentation.

Dummy data

For this experiment, I have created dummy data of potential duplicates. This is shown below:

As you can predict, the fuzzy duplicates are (0,1), (2,3), and (6,7).

CSVDedupe is used as a command-line tool. Thus, we should dump this data into a CSV file.

Marking duplicates

In the command line, CSVDedupe takes an input CSV file and a couple more arguments.

The command is written below:

First, we provide the input CSV file. Next, we specify the fields we wish to consider for deduplication. This is specified as --field_names. In this case, we wish to consider all fields for deduplication, but if you want to mark duplicates based on a subset of column entries, you can do it with this argument.

Lastly, we have the --output_file argument, which, as the name suggests, is used to specify the name of the output file.

When we run this in the command line, CSVDedupe will perform its active learning step.

In a gist, it will pick some instances from the given data and ask you if they are duplicates or not, as shown below:

Active learning step of CSVDedupe (Image by Author)

You should provide your input as long as you wish to. Once you are done, press f.

Next, it will automatically start identifying duplicates based on the blocking predicates learned by CSVDedupe during its active learning.

Once done, the output will be stored in the file provided, specified in the --output_file argument.

Post deduplication, we get the following output:

CSVDedupe inserts a new column, namely Cluster ID. A set of records with the same Cluster ID refers to potentially duplicated records, as identified by the CSVDedupe’s model.

For instance, in this case, the model suggests that both records under Cluster ID = 0 are duplicates, which is also correct.


Photo by Sangga Rima Roman Selia on Unsplash

In today’s data-driven world, the importance of high-quality data to build quality systems cannot be overstated.

The availability of reliable data is highly critical for teams to make informed decisions, develop effective strategies, and gain valuable insights.

However, at times, the quality of this data gets compromised by various factors, one of which is the presence of fuzzy duplicates.

A set of records are fuzzy duplicates when they look similar but are not 100% identical.

For instance, consider the two records:

Fuzzy duplicates example (Image by Author)

In this example, the two records have similar but not identical values for both the name and address fields.

How do we get duplicates?

Duplicates can arise due to various reasons, such as misspellings, abbreviations, variations in formatting, or data entry errors.

These, at times, can be challenging to identify and address, as they may not be immediately apparent. Thus, they may require sophisticated algorithms and techniques to detect.

Implications of duplicates

Fuzzy duplicates can pose significant implications on data quality. This is because they result in inaccurate or incomplete analysis and decision-making.

For instance, if your dataset contains fuzzy duplicates, and you analyze it, you may end up overestimating or underestimating certain variables. This will lead to flawed conclusions.

Having understood the importance of the problem, in this blog post, let’s understand how you can perform data deduplication.

Let’s begin 🚀!

Imagine you have a dataset with over a million records that may contain some fuzzy duplicates.

The simplest yet intuitive approach that many often come up with involves comparing every pair of records.

However, this quickly gets infeasible as the size of your dataset grows.

For instance, if you have a million records (10⁶), by following the naive approach, you would have to perform over 10¹² comparisons (), as shown below:

def is_duplicate(record1, record2):
## function to determine whether record1 and record2
## are similar or not.
...

for record1 in all_records:
for record2 in all_records:
result = is_duplicate(record1, record2)

Even if we assume a decent speed of 10,000 comparisons per second, it will take roughly three years to complete.

CSVDedupe is an ML-based open-source command-line tool that identifies and removes duplicate records in a CSV file.

One of its key features is blocking, which drastically improves the run-time of deduplication.

For instance, if you are finding duplicates in names, the approach suggests that comparing the name “Daniel” to “Philip” or “Shannon” to “Julia” makes no sense. They are guaranteed to be distinct records.

In other words, two duplicates will always have some common lexical overlap. However, the naive approach still compares them.

Using blocking, CSVDedupe groups records into smaller buckets and only performs comparisons between them.

This is an efficient way to reduce the number of redundant comparisons, as it is unlikely that records in different groups will be duplicates.

For example, one grouping rule could be to check if the first three letters of the name field are the same.

In that case, records with different first three letters in their name field would be in different groups and would not be compared.

Blocking using CSVDedupe (Image by Author)

However, records with the same first three letters in their name field would be in the same block, and only those records would be compared to each other.

This saves us from many redundant comparisons, which are guaranteed to be non-duplicates, like “John” and “Peter.”

CSVDedupe uses active learning to identify these blocking rules.

Let’s now look at a demo of CSVDedupe.

Install Dedupe

To install CSVDedupe, run the following command:

And done! We can now move to experimentation.

Dummy data

For this experiment, I have created dummy data of potential duplicates. This is shown below:

As you can predict, the fuzzy duplicates are (0,1), (2,3), and (6,7).

CSVDedupe is used as a command-line tool. Thus, we should dump this data into a CSV file.

Marking duplicates

In the command line, CSVDedupe takes an input CSV file and a couple more arguments.

The command is written below:

First, we provide the input CSV file. Next, we specify the fields we wish to consider for deduplication. This is specified as --field_names. In this case, we wish to consider all fields for deduplication, but if you want to mark duplicates based on a subset of column entries, you can do it with this argument.

Lastly, we have the --output_file argument, which, as the name suggests, is used to specify the name of the output file.

When we run this in the command line, CSVDedupe will perform its active learning step.

In a gist, it will pick some instances from the given data and ask you if they are duplicates or not, as shown below:

Active learning step of CSVDedupe (Image by Author)

You should provide your input as long as you wish to. Once you are done, press f.

Next, it will automatically start identifying duplicates based on the blocking predicates learned by CSVDedupe during its active learning.

Once done, the output will be stored in the file provided, specified in the --output_file argument.

Post deduplication, we get the following output:

CSVDedupe inserts a new column, namely Cluster ID. A set of records with the same Cluster ID refers to potentially duplicated records, as identified by the CSVDedupe’s model.

For instance, in this case, the model suggests that both records under Cluster ID = 0 are duplicates, which is also correct.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment