Techno Blender
Digitally Yours.

Deduplicate and clean up millions of location records | by Dr. Paul Kinsvater | Sep, 2022

0 52


How record linkage and geocoding combined improve data quality

Photo by Ralph (Ravi) Kayden on Unsplash

Big companies store data in several systems for different purposes (ERPs, CRMs, local files). Each potentially holds customer data, and not all of them, if any, are in sync. In addition, links across sources either do not exist or are not appropriately maintained. The consequence is duplicate records, inconsistencies, and poor data quality in general. That’s a perfect opportunity for us to shine with an algorithmic solution.

This article is about records having address attributes. And my proposal works comfortably for millions of records in a reasonable time. The predominant use case, likely applicable to most larger companies, is customer records having billing or work site addresses. So we are going to tackle the following pain points of a business:

  • How do we eliminate all the duplicate records within each of our customer data sources? And how do we link records across all our data sources to summarize a 360 view of any single customer?
  • How confident are we about the quality of each address record? How can we identify and fix invalid or incomplete records quickly?

My proposal consists of two parts, record linkage and geocoding. The output of both steps helps accelerate the inevitable manual review process: we start with, say, a million records. Then, the algorithms summarize a practicable shortlist of likely quality issues, and skilled reviewers spend some hours (or days) evaluating the results.

What I learned about algorithmic record linkage for locations

This article is about records with an address. If yours consist of just the addresses and nothing else, jump over to the next section. My example below is about customer location records — addresses with names. The same ideas apply to more complex situations with amounts, dates and times, etc., such as contract records. So imagine we deal with a large table of customer locations from Benelux, with 7 of those given below.

A few examples of location records with duplicates. This is artificially generated data inspired by records the author has seen in a real-world use case (image by author).

In the simplest case, two records represent the same entity if all relevant attributes are the same. But that does not account for typos, language, or other variations of names and addresses. So we need a sense of similarity (or distance) that works for words and other character strings. That’s where Record Linkage helps, with at least a dozen open-source frameworks; see this overview. I use Python and the RecordLinkage package to illustrate the process and key learnings. We start with text preprocessing which can make a big difference in the matching quality.

First, we normalized the countries. It’s a simple and vital step for index blocking, which will follow in a moment. Second, we used RecordLinkage’s default clean-up method (all lowercase, no punctuation, encoding, etc.). Potentially, we can do much more by borrowing ideas (and code) from the NLP community. If you want to learn more, start with TextHero:

E.g., so-called “stop word removal,” which, in our example, may translate to removing legal forms such as the Dutch “N.V.” or other frequent words such as “Hotel” (say we have many Hotels as customers).

Record linkage can be computationally intensive. A million records potentially translate to comparing a trillion pairs. Indexing techniques reduce the number of candidate pairs, with the simplest called “blocking”: comparing just those with a common attribute. My preferred choice for blocking is the country of an address. It is the attribute with the best quality or at least simple to fix. If index blocking by country results in too many operations to handle: combine with sorted neighborhood indexing on a second high-quality attribute, such as the city or zip (or customer name if you run out of options).

Having our candidates ready, we define how to measure their similarities in the following code snippet. The package comes with several built-in choices for measuring the similarity of individual string components — see the documentation of the String class. Jaro-Winkler is a good fit for (short) names, putting more importance near the start of a string. Levenshtein makes more sense for postal codes.

Table of similarity scores (image by author).

I have added a weighted score, with weights based on gut feeling. Such a scalar summary and a threshold allow us to make a final “yes” or “no” decision. Alternatively, we can fit a classification model on a relatively small, balanced set of labeled examples. But be cautious with interpreting the model performance since, in reality, we face an extremely imbalanced problem (many more non-links than links).

Often there is more than one duplicate; sometimes, there are dozens of the same entity. And the manual review process benefits from having all likely copies of a single address side by side. I use Python’s NetworkX package: records are nodes, and similarities above a threshold are edges. Every subgraph is then such a collection of likely copies or links.

Original data is extended by clusters of similar records (image by author).

We missed putting records 1 and 2 into their cluster. We would have by choosing a lower threshold at the risk of adding false positives to our output. So how to select a threshold programmatically? A simple solution is illustrated in the figure below.

A histogram has been created using 40k pairs of real-world location record comparisons. The choice of the threshold (dashed line) tries to “best separate” the two unknown distributions of correct and incorrect matches, assuming they are roughly unimodal and symmetric (image by author).

Alternatively, you can borrow a solution from classification literature: know your costs and benefits for all four cases in the confusion matrix, and estimate their frequencies as a function of the threshold. But that will require a relatively large set of labeled examples due to the high imbalance.

Finally, we add summary statistics for every record within a cluster to indicate our confidence in the matching quality.

For each record assigned to a cluster, we compute the minimum, average, and maximum across similarity scores with all other records in the same cluster (image by author).

Skilled reviewers can use these statistics to sort and quickly work off the almost perfect matches and spend time where human review matters most.

How geoapify.com helps improve quality and enrich location records

Geocoding is the process of translating an address into latitude and longitude. There are plenty of free services if you deal with just a small set of addresses — check out GeoPy. But all of them are not practicable (and likely not legal) if the size of your data exceeds, say, a thousand records. And a thousand still is tiny in the real world. Even if you start looking into commercial providers like Google Maps, you will realize that they either do not offer a “batch” geocoding service or are expensive. Fortunately, geoapify.com fills this niche. And that’s not the only good news: their web service utilizes the openstreetmap.org ecosystem. Establishing a connection between internal and open data opens opportunities beyond location data quality.

OK, but why do we discuss geocoding when the topic is data quality? First, it is a special kind of record linkage solution for addresses. And indeed, we may even use this as a preprocessing step in the previous section. But the main reason is that the service does not expect perfect search input from users. Nominatim (OpenStreetMap’s geocoding engine) extracts features from the search text and applies scoring logic to determine the best match with a known location record. And that best match is delivered in a structured form, including geo-coordinates and several confidence scores: for the street, the city, and overall confidence. Low scores in any of the three numbers indicate poor quality, which helps to identify data quality issues quickly in the original input.

We continue with our example from the previous section. If you want to repeat, you must sign up at geoapify.com and generate your key. An extensive free tier allows you to geocode up to 6000 addresses per day for free.

The batch geocoding service accepts lists of strings as input, one string per address. We concatenate our structured address data to make this work, request batch geocoding, and present selected output attributes parsed into a DataFrame.

The output of the geoapify.com batch geocoding service is parsed to a DataFrame. The three last columns indicate data quality issues in the original input data (image by author).

The service returns much more than an address. It also indicates the type of the location. We do not expect whole districts as in address 5—the original input turns out to be a PO box.

Conclusion and outlook

Companies grow organically or through mergers and acquisitions. So does their data. And usually, the quality does not keep up with growth. This article proposes a method to accelerate the clean-up of messy location records (customers with billing addresses, work sites, etc.). It starts with a two-step procedure based on algorithmic linkage and geocoding. The algorithms scale well to millions of records, and, from my experience, skilled reviewers can handle the manual check within a surprisingly short time when using the outputs.

We use the batch geocoding service of geoapify.com to validate the address data quality. And that is just one of many opportunities enabled by their web service.

  • Enriching our data by geo coordinates allows us to add location intelligence to many problems we can tackle with data science; see this open-source book for an introduction to spatial data science. Do you deal with customer churn? Have you checked if customers located near others who churned also are at risk of leaving?
  • Geoapify.com utilizes the OpenStreetMap ecosystem, which links to Wikidata. So we can connect our internal location records with a massive open-source data set. The place_id attribute is part of every geocoding output. And that can tell us a lot more about a location. Again, we can use another geoapify.com endpoint called Place Details API to help us with this job. E.g., using the place_id of address 7, Hotel Astoria, we get many more details such as a link to their website and the Wikidata Id Q649690. On the other hand, the Places API can tell us which Hotels we miss in our customer database in any given region.

This was my first article about location intelligence and related topics. More will follow soon.


How record linkage and geocoding combined improve data quality

Cable spaghetti as a synonym for poor data quality from multiple sources.
Photo by Ralph (Ravi) Kayden on Unsplash

Big companies store data in several systems for different purposes (ERPs, CRMs, local files). Each potentially holds customer data, and not all of them, if any, are in sync. In addition, links across sources either do not exist or are not appropriately maintained. The consequence is duplicate records, inconsistencies, and poor data quality in general. That’s a perfect opportunity for us to shine with an algorithmic solution.

This article is about records having address attributes. And my proposal works comfortably for millions of records in a reasonable time. The predominant use case, likely applicable to most larger companies, is customer records having billing or work site addresses. So we are going to tackle the following pain points of a business:

  • How do we eliminate all the duplicate records within each of our customer data sources? And how do we link records across all our data sources to summarize a 360 view of any single customer?
  • How confident are we about the quality of each address record? How can we identify and fix invalid or incomplete records quickly?

My proposal consists of two parts, record linkage and geocoding. The output of both steps helps accelerate the inevitable manual review process: we start with, say, a million records. Then, the algorithms summarize a practicable shortlist of likely quality issues, and skilled reviewers spend some hours (or days) evaluating the results.

What I learned about algorithmic record linkage for locations

This article is about records with an address. If yours consist of just the addresses and nothing else, jump over to the next section. My example below is about customer location records — addresses with names. The same ideas apply to more complex situations with amounts, dates and times, etc., such as contract records. So imagine we deal with a large table of customer locations from Benelux, with 7 of those given below.

A few examples of location records with duplicates. This is artificially generated data inspired by records the author has seen in a real-world use case (image by author).

In the simplest case, two records represent the same entity if all relevant attributes are the same. But that does not account for typos, language, or other variations of names and addresses. So we need a sense of similarity (or distance) that works for words and other character strings. That’s where Record Linkage helps, with at least a dozen open-source frameworks; see this overview. I use Python and the RecordLinkage package to illustrate the process and key learnings. We start with text preprocessing which can make a big difference in the matching quality.

First, we normalized the countries. It’s a simple and vital step for index blocking, which will follow in a moment. Second, we used RecordLinkage’s default clean-up method (all lowercase, no punctuation, encoding, etc.). Potentially, we can do much more by borrowing ideas (and code) from the NLP community. If you want to learn more, start with TextHero:

E.g., so-called “stop word removal,” which, in our example, may translate to removing legal forms such as the Dutch “N.V.” or other frequent words such as “Hotel” (say we have many Hotels as customers).

Record linkage can be computationally intensive. A million records potentially translate to comparing a trillion pairs. Indexing techniques reduce the number of candidate pairs, with the simplest called “blocking”: comparing just those with a common attribute. My preferred choice for blocking is the country of an address. It is the attribute with the best quality or at least simple to fix. If index blocking by country results in too many operations to handle: combine with sorted neighborhood indexing on a second high-quality attribute, such as the city or zip (or customer name if you run out of options).

Having our candidates ready, we define how to measure their similarities in the following code snippet. The package comes with several built-in choices for measuring the similarity of individual string components — see the documentation of the String class. Jaro-Winkler is a good fit for (short) names, putting more importance near the start of a string. Levenshtein makes more sense for postal codes.

Table of similarity scores (image by author).

I have added a weighted score, with weights based on gut feeling. Such a scalar summary and a threshold allow us to make a final “yes” or “no” decision. Alternatively, we can fit a classification model on a relatively small, balanced set of labeled examples. But be cautious with interpreting the model performance since, in reality, we face an extremely imbalanced problem (many more non-links than links).

Often there is more than one duplicate; sometimes, there are dozens of the same entity. And the manual review process benefits from having all likely copies of a single address side by side. I use Python’s NetworkX package: records are nodes, and similarities above a threshold are edges. Every subgraph is then such a collection of likely copies or links.

Original data is extended by clusters of similar records (image by author).

We missed putting records 1 and 2 into their cluster. We would have by choosing a lower threshold at the risk of adding false positives to our output. So how to select a threshold programmatically? A simple solution is illustrated in the figure below.

A histogram has been created using 40k pairs of real-world location record comparisons. The choice of the threshold (dashed line) tries to “best separate” the two unknown distributions of correct and incorrect matches, assuming they are roughly unimodal and symmetric (image by author).

Alternatively, you can borrow a solution from classification literature: know your costs and benefits for all four cases in the confusion matrix, and estimate their frequencies as a function of the threshold. But that will require a relatively large set of labeled examples due to the high imbalance.

Finally, we add summary statistics for every record within a cluster to indicate our confidence in the matching quality.

For each record assigned to a cluster, we compute the minimum, average, and maximum across similarity scores with all other records in the same cluster (image by author).

Skilled reviewers can use these statistics to sort and quickly work off the almost perfect matches and spend time where human review matters most.

How geoapify.com helps improve quality and enrich location records

Geocoding is the process of translating an address into latitude and longitude. There are plenty of free services if you deal with just a small set of addresses — check out GeoPy. But all of them are not practicable (and likely not legal) if the size of your data exceeds, say, a thousand records. And a thousand still is tiny in the real world. Even if you start looking into commercial providers like Google Maps, you will realize that they either do not offer a “batch” geocoding service or are expensive. Fortunately, geoapify.com fills this niche. And that’s not the only good news: their web service utilizes the openstreetmap.org ecosystem. Establishing a connection between internal and open data opens opportunities beyond location data quality.

OK, but why do we discuss geocoding when the topic is data quality? First, it is a special kind of record linkage solution for addresses. And indeed, we may even use this as a preprocessing step in the previous section. But the main reason is that the service does not expect perfect search input from users. Nominatim (OpenStreetMap’s geocoding engine) extracts features from the search text and applies scoring logic to determine the best match with a known location record. And that best match is delivered in a structured form, including geo-coordinates and several confidence scores: for the street, the city, and overall confidence. Low scores in any of the three numbers indicate poor quality, which helps to identify data quality issues quickly in the original input.

We continue with our example from the previous section. If you want to repeat, you must sign up at geoapify.com and generate your key. An extensive free tier allows you to geocode up to 6000 addresses per day for free.

The batch geocoding service accepts lists of strings as input, one string per address. We concatenate our structured address data to make this work, request batch geocoding, and present selected output attributes parsed into a DataFrame.

The output of the geoapify.com batch geocoding service is parsed to a DataFrame. The three last columns indicate data quality issues in the original input data (image by author).

The service returns much more than an address. It also indicates the type of the location. We do not expect whole districts as in address 5—the original input turns out to be a PO box.

Conclusion and outlook

Companies grow organically or through mergers and acquisitions. So does their data. And usually, the quality does not keep up with growth. This article proposes a method to accelerate the clean-up of messy location records (customers with billing addresses, work sites, etc.). It starts with a two-step procedure based on algorithmic linkage and geocoding. The algorithms scale well to millions of records, and, from my experience, skilled reviewers can handle the manual check within a surprisingly short time when using the outputs.

We use the batch geocoding service of geoapify.com to validate the address data quality. And that is just one of many opportunities enabled by their web service.

  • Enriching our data by geo coordinates allows us to add location intelligence to many problems we can tackle with data science; see this open-source book for an introduction to spatial data science. Do you deal with customer churn? Have you checked if customers located near others who churned also are at risk of leaving?
  • Geoapify.com utilizes the OpenStreetMap ecosystem, which links to Wikidata. So we can connect our internal location records with a massive open-source data set. The place_id attribute is part of every geocoding output. And that can tell us a lot more about a location. Again, we can use another geoapify.com endpoint called Place Details API to help us with this job. E.g., using the place_id of address 7, Hotel Astoria, we get many more details such as a link to their website and the Wikidata Id Q649690. On the other hand, the Places API can tell us which Hotels we miss in our customer database in any given region.

This was my first article about location intelligence and related topics. More will follow soon.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment