Deduplicate and clean up millions of location records | by Dr. Paul Kinsvater | Sep, 2022

By Jessie Hobb On Sep 15, 2022

How record linkage and geocoding combined improve data quality

Photo by Ralph (Ravi) Kayden on Unsplash

Big companies store data in several systems for different purposes (ERPs, CRMs, local files). Each potentially holds customer data, and not all of them, if any, are in sync. In addition, links across sources either do not exist or are not appropriately maintained. The consequence is duplicate records, inconsistencies, and poor data quality in general. That’s a perfect opportunity for us to shine with an algorithmic solution.

This article is about records having address attributes. And my proposal works comfortably for millions of records in a reasonable time. The predominant use case, likely applicable to most larger companies, is customer records having billing or work site addresses. So we are going to tackle the following pain points of a business:

How do we eliminate all the duplicate records within each of our customer data sources? And how do we link records across all our data sources to summarize a 360 view of any single customer?
How confident are we about the quality of each address record? How can we identify and fix invalid or incomplete records quickly?

My proposal consists of two parts, record linkage and geocoding. The output of both steps helps accelerate the inevitable manual review process: we start with, say, a million records. Then, the algorithms summarize a practicable shortlist of likely quality issues, and skilled reviewers spend some hours (or days) evaluating the results.

What I learned about algorithmic record linkage for locations

This article is about records with an address. If yours consist of just the addresses and nothing else, jump over to the next section. My example below is about customer location records — addresses with names. The same ideas apply to more complex situations with amounts, dates and times, etc., such as contract records. So imagine we deal with a large table of customer locations from Benelux, with 7 of those given below.

A few examples of location records with duplicates. This is artificially generated data inspired by records the author has seen in a real-world use case (image by author).

How record linkage and geocoding combined improve data quality

Cable spaghetti as a synonym for poor data quality from multiple sources. — Photo by Ralph (Ravi) Kayden on Unsplash

How do we eliminate all the duplicate records within each of our customer data sources? And how do we link records across all our data sources to summarize a 360 view of any single customer?
How confident are we about the quality of each address record? How can we identify and fix invalid or incomplete records quickly?

What I learned about algorithmic record linkage for locations

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Deduplicate and clean up millions of location records | by Dr. Paul Kinsvater | Sep, 2022

How record linkage and geocoding combined improve data quality

What I learned about algorithmic record linkage for locations

How geoapify.com helps improve quality and enrich location records

Conclusion and outlook

How record linkage and geocoding combined improve data quality

What I learned about algorithmic record linkage for locations

How geoapify.com helps improve quality and enrich location records

Conclusion and outlook