Basic Data Quality Scoring. Weighting Features by User Rankings | by Andre Violante | May, 2022

By Jessie Hobb On May 27, 2022

Weighting Features by User Rankings

Image from Unsplash.com by @javaistan

I recently worked on a project where there was a very fast turn around time for determining data quality. The data available was mostly 3rd party of vendor supplied B2B data and it was very sparse (lots of null values). If you’ve worked with sales or marketing departments at a B2B company this isn’t new or unusual. If you haven’t, imagine you have some data about potential customers that you want your sales or marketing teams to target for your specific product. For example, if you sell software to other businesses, which prospects or leads should the sales and marketing teams target first with messaging? Who should the sales team call first or prioritize from the massive vendor list. This was basically the request and first project that came to my team within a few weeks hired at the company.

Coming up with a way to create a Data Quality Score (DQS), can take you down many paths with many interpretations. If you’re a data scientist the path may seem clear from a machine learning perspective. But what if you don’t have the time or resources to go through the research and data science due diligence? What if you only have a 1–2 weeks to deliver something?

In this post, I’ll walk through a very basic 3-step approach to derive a DQS for your data using weights generated from user rankings. This is a fast and basic approach that can be done without any machine learning experience, but deliver actionable results. Keep in mind, the DQS may sound misleading because it does not reflect the “quality” of a record with respect to an outcome. That would be some machine learning probabilistic model. The DQS reflects the information available for a given data record and weights it if the fields available are deemed important based on stakeholder survey results. Sometimes keeping it simple with data is the best approach and can get you to a v1 very quickly!

Step 1: Which Data Matters Most

The first step is getting stakeholder feedback on which features or variables matters most to solving the problem. One way to do this is by simply asking the customer (business stakeholders) what matters most with respect to the problem. Your stakeholders typically have a lot of domain expertise. They will know which fields in the data are most helpful to their work. Let’s use lead generation as an example. The stakeholder may already know that they’ve seen the best success from leads when they know the job of the individual or if they’ve shopped with us in the past. So from their side when working leads they may just filter out any job where value = NaN immediately to waste time. This is where the DQS can help because we can weight and derive a score based off many fields vs just one while giving them a single value to sort or filter from! So we need to just do a quick exercise with the business to get a list of fields they believe are important to them.

Once you have a good list of these “important” features per the stakeholders you can move on to to creating a survey asking stakeholders to rank the following features in terms of importance (experience with lead success). Let’s just say from our quick exploratory conversations we have 9 possible features that our stakeholders say are important (Table 1). We’ll ask them to force rank each of the features in our survey and we should get something back like Table 2 where each row is an individuals response for each feature. Notice on a per individual or row level there are now repeating ranking values.

Table 2: Force Rank Importance by Column or Feature. Each Row represents a single response across features.

Step 2: Get Rank Values and Weights

Now that we have our survey responses there are many things you can do, but what I did to make it easy is get the average ranking per feature and then rank that output 1 through 9 as you can see in Table 3 below in the avg_rating and feature_rank columns. Once you have that average rating, you can apply the simple rank sum equation which is: n – r + 1 where n is the total number of features (9) and r is the specific feature ranking (1–9).

Example: the average rank for lastname is 4.067 making it the lowest ranked or most important of the features so its feature_rank is a 1. Therefore, the sum rank equation would be: 9 – 1 + 1 and rank_sum equals 9.

Table 3: Average rank aggregation, new feature ranking, rank sum calculation, and new feature weights

Once we’ve done the simple rank sum equation for all our features we want to assign weights. When using rank sum the common way is just to divide the rank_sum value by the sum of rank_sum which will give you a value between 0 and 1 that basically shows as a pecent how much of the total does this ranked feature contribute.

Example: lastname has a rank_sum value of 9. The sum or total of our rank_sum column is 45. Therefore we assign the weight 0.20 (9/45) for the feature lastname.

Step 3: Apply your Data Quality Score

Now that we have weights for all our features the next step is also simple. Looking back at our original data in Table 1 we’re going to replace all NaN values with a 0 basically and all our actual values with the corresponding weight. For our binary features (True/False), it’s the same thing. NaN values get a 0 and and all other values receive the weight. Remember, in the actual data I had the data included A LOT of NaN values versus the generated data in Table 1 so this approach made more sense and yielded good results. Let’s look at a quick scoring example to make sure it’s clear.

Example: Looking at our first “lead” in Table 1, we see Gina Beasley. The record of data is completely available except for the city feature where we have a NaN value. Since we know the sum of our weights = 1 then the score is simply the city weight subtracted from 1 or 1 – 0.10 and the data quality score for Gina would be 0.90. To compare, Donald Solomon is also only missing 1 field, salary , but would receive a slightly lower DQS of 0.87 (1 – 0.13) because per stakeholders salary is less important to them then city.

Conclusion

Well thats your first pass at deriving a Data Quality Score. Its simple to explain to the business, deploys easily with some simple code, and gets you to a v1 within a couple days! I acknowledge that there are lots of assumptions in this approach and its very simple but it provided a quick win for the team and departments we worked with. That said, take this approach, modify it and build your next great DQS!

References

Weighting by Ranking resource: http://www.gitta.info/Suitability/en/html/Normalisatio_learningObject1.html
Github Repo with Starter Code

Acknowledgements

I worked on this great project with my good colleague Michael Jimney! This was a fun early win Mike. Thank you!