Practical Data Quality Auditing: A Comprehensive Guide | by Mohamed A. Warsame | May, 2023


Image by author.

You can’t manage what you can’t measure — Peter Drucker

Data quality auditing is an indispensable skill in our rapidly evolving, AI-empowered world. Just like crude oil needs refining, data also requires cleaning and processing to be useful. The old adage “garbage in, garbage out” remains as relevant today as it was in the early days of computing.

In this article, we’ll explore how Python can help us ensure our datasets meet quality standards for successful projects. We’ll delve into Python libraries, code snippets, and examples that you can use in your own workflows.

Table of Contents:

  1. Understanding Data Quality and Its Dimensions
  2. Validating Data Using Pydantic and pandas_dq
  3. Comparing Pydantic and pandas_dq
  4. Exploring Accuracy and Consistency
  5. Data Quality Auditing with pandas_dq
  6. Conclusion

Before diving into tools and techniques, let’s first review the concept of data quality. According to a widely accepted industry definition, data quality refers to the degree to which a dataset is accurate, complete, timely, valid, unique in identifier attributes, and consistent.

Data Quality Dimensions. Image by author.

Completeness

Completeness in data quality encompasses the availability of all vital data elements required to fulfill a specific objective. Take, for example, a customer database tailored for marketing purposes; it would be deemed incomplete if essential contact information such as phone numbers or email addresses were missing for certain customers.

To ensure data completeness, organizations can employ data profiling techniques.

Data profiling is the systematic examination and assessment of datasets to uncover patterns, inconsistencies, and anomalies.

By scrutinizing the data meticulously, one can identify gaps, idiosyncrasies or missing values, enabling corrective measures such as sourcing the missing information or implementing robust data validation processes. The result is a more reliable, complete, and actionable dataset that empowers better decision-making, optimized marketing efforts, and ultimately, drive business success.

But before thorough data profiling, the first step in any data quality audit is a review of the data dictionary: a concise, descriptive reference that defines the structure, attributes, and relationships of data elements within a dataset, serving as a guide for understanding and interpreting the data’s meaning and purpose.

Data Dictionary Example. Image by author.

With a thorough review or creation of your data dictionary in hand, assessing completeness becomes a breeze when you leverage the power of low-code libraries such as Sweetviz, Missingno, or Pandas_DQ.

import missingno as msno
import sweetviz as sv
from pandas_dq import dq_report

# completeness check
msno.matrix(df)

# data profiling
Report = sv.analyze(df)

Report.show_notebook()

Personally, I gravitate towards the Pandas-Matplotlib-Seaborn combo, as it provides me with the flexibility to have full control over my output. This way, I can craft an engaging and visually appealing analysis.

# check for missing values
import seaborn as sns
import matplotlib.pyplot as plt

def plot_missing_values(df: pd.DataFrame,
title="Missing Values Plot"):
plt.figure(figsize=(10, 6))

sns.displot(
data=df.isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25
)
plt.title(title)
plt.show()

plot_missing_values(df)

Missing Values Plot. Image by author.

Uniqueness

Uniqueness is a data quality dimension that emphasizes the absence of duplicate data in columns with uniqueness constraint. Each record should represent a unique entity without redundancy. For example, a user list should have unique IDs for each registered user; multiple records with the same ID indicate a lack of uniqueness.

In the below example I’m mimicking a data integration step of merging two identically structured datasets. The Pandas concat function’s argument verify_integrity throws an error if uniqueness is violated:

# verify integrity check
df_loans = pd.concat([df, df_pdf], verify_integrity=True)

# check duplicated ids
df_loans[df_loans.duplicated(keep=False)].sort_index()

Violation of Uniqueness. Image by author.

Ideally, you would check the presence of duplication as part of your data quality audit.

def check_duplicates(df, col):
'''
Check how many duplicates are in col.
'''
# first step set index
df_check = df.set_index(col)
count = df_check.index.duplicated().sum()
del df_check
print("There are {} duplicates in {}".format(count, col))

Timeliness

Timeliness is an aspect of data quality that focuses on the availability and cadence of the data. Up-to-date and readily available data is essential for accurate analysis and decision-making. For example, a timely sales report should include the most recent data possible, not only data from several months prior. The dataset we have been using thus far for the examples doesn’t have a time dimension for us to explore cadence more deeply.

Timeliness example. Image by author.

Validity

As we transition to the concept of validity, one should recognize its role in ensuring that data adheres to the established rules, formats, and standards. Validity guarantees compliance with the schema, constraints, and data types designated for the dataset. We can use the powerful Python library Pydantic for this:

# data validation on the data dictionary
from pydantic import BaseModel, Field, conint, condecimal, constr

class LoanApplication(BaseModel):
Loan_ID: int
Gender: conint(ge=1, le=2)
Married: conint(ge=0, le=1)
Dependents: conint(ge=0, le=3)
Graduate: conint(ge=0, le=1)
Self_Employed: conint(ge=0, le=1)
ApplicantIncome: condecimal(ge=0)
CoapplicantIncome: condecimal(ge=0)
LoanAmount: condecimal(ge=0)
Loan_Amount_Term: condecimal(ge=0)
Credit_History: conint(ge=0, le=1)
Property_Area: conint(ge=1, le=3)
Loan_Status: constr(regex="^[YN]$")

# Sample loan application data
loan_application_data = {
"Loan_ID": 123456,
"Gender": 1,
"Married": 1,
"Dependents": 2,
"Graduate": 1,
"Self_Employed": 0,
"ApplicantIncome": 5000,
"CoapplicantIncome": 2000,
"LoanAmount": 100000,
"Loan_Amount_Term": 360,
"Credit_History": 1,
"Property_Area": 2,
"Loan_Status": "Y"
}

# Validate the data using the LoanApplication Pydantic model
loan_application = LoanApplication(**loan_application_data)

Once tested with an example, we can run the entire dataset through a validation check which should print “no data validation issues” if successful:

# data validation on the data dictionary
from pydantic import ValidationError
from typing import List

# Function to validate DataFrame and return a list of failed LoanApplication objects
def validate_loan_applications(df: pd.DataFrame) -> List[LoanApplication]:
failed_applications = []

for index, row in df.iterrows():
row_dict = row.to_dict()

try:
loan_application = LoanApplication(**row_dict)
except ValidationError as e:
print(f"Validation failed for row {index}: {e}")
failed_applications.append(row_dict)

return failed_applications

# Validate the entire DataFrame
failed_applications = validate_loan_applications(df_loans.reset_index())

# Print the failed loan applications or "No data quality issues"
if not failed_applications:
print("No data validation issues")
else:
for application in failed_applications:
print(f"Failed application: {application}")

We can do the same with pandas_dq, using far less code:

from pandas_dq import DataSchemaChecker

schema = {
'Loan_ID': 'int64',
'Gender': 'int64',
'Married': 'int64',
'Dependents': 'int64',
'Graduate': 'int64',
'Self_Employed': 'int64',
'ApplicantIncome': 'float64',
'CoapplicantIncome': 'float64',
'LoanAmount': 'float64',
'Loan_Amount_Term': 'float64',
'Credit_History': 'int64',
'Property_Area': 'int64',
'Loan_Status': 'object'
}

checker = DataSchemaChecker(schema)

checker.fit(df_loans.reset_index())

This returns an easy to read Pandas dataframe style report that details any validation issues encounter. I’ve provided an incorrect schema where int64 variables have been reported to be float64 variables. The library has correctly identified these:

DataSchemaChecker output. Image by author.

The data type mismatch is rectified with a single line of code using the checker object created from the DataSchemaChecker class:

# fix issues
df_fixed = checker.transform(df_loans.reset_index())
DataSchemaChecker transform output. Image by author.

Pydantic or pandas_dq?

There are some differences between Pydantic and pandas_dq:

  1. Declarative syntax: arguably, Pydantic allows you to define the data schema and validation rules using a more concise and readable syntax. This can make it easier to understand and maintain your code. I find it super helpful to be able to define the ranges of possible values instead of merely the data type.
  2. Built-in validation functions: Pydantic provides various powerful built-in validation functions like conint, condecimal, and constr, which allow you to enforce constraints on your data without having to write custom validation functions.
  3. Comprehensive error handling: When using Pydantic, if the input data does not conform to the defined schema, it raises a ValidationError with detailed information about the errors. This can help you easily identify issues with your data and take necessary action.
  4. Serialization and deserialization: Pydantic automatically handles serialization and deserialization of data, making it convenient to work with different data formats (like JSON) and convert between them.

In conclusion, Pydantic offers a more concise, feature-rich, and user-friendly approach to data validation compared to the DataSchemaChecker class from pandas_dq.

Pydantic is likely a better choice for validating your data schema in a productionized environment. But if you just want to get up and running quickly with a prototype, you might prefer the low-code nature of the DataSchemaChecker.

Accuracy & Consistency

There are 2 further data quality dimensions which we haven’t explored up until now:

  • Accuracy is a data quality dimension that addresses the correctness of data, ensuring it represents real-world situations without errors. For instance, an accurate customer database should contain correct and up-to-date addresses for all customers.
  • Consistency deals with the uniformity of data across different sources or datasets within an organization. Data should be consistent in terms of format, units, and values. For example, a multinational company should report revenue data in a single currency to maintain consistency across its offices in various countries.

You can check all data quality issues present in a dataset using the dq_report function:

from pandas_dq import dq_report

dq_report(df_loans.reset_index(), target=None, verbose=1)

It detects the below data quality issues:

  • Strongly associated variables (multicollinearity)
  • Columns with no variance (redundant features)
  • Asymmetrical data distributions (anomalies, outliers, ect.)
  • Infrequent category occurrences
DQ Report from pandas_dq libray. Image by author.

Performing data quality audits is crucial for maintaining high-quality datasets, which in turn drive better decision-making and business success. Python offers a wealth of libraries and tools that make the auditing process more accessible and efficient.

By understanding and applying the concepts and techniques discussed in this article, you’ll be well-equipped to ensure your datasets meet the necessary quality standards for your projects.


Image by author.

You can’t manage what you can’t measure — Peter Drucker

Data quality auditing is an indispensable skill in our rapidly evolving, AI-empowered world. Just like crude oil needs refining, data also requires cleaning and processing to be useful. The old adage “garbage in, garbage out” remains as relevant today as it was in the early days of computing.

In this article, we’ll explore how Python can help us ensure our datasets meet quality standards for successful projects. We’ll delve into Python libraries, code snippets, and examples that you can use in your own workflows.

Table of Contents:

  1. Understanding Data Quality and Its Dimensions
  2. Validating Data Using Pydantic and pandas_dq
  3. Comparing Pydantic and pandas_dq
  4. Exploring Accuracy and Consistency
  5. Data Quality Auditing with pandas_dq
  6. Conclusion

Before diving into tools and techniques, let’s first review the concept of data quality. According to a widely accepted industry definition, data quality refers to the degree to which a dataset is accurate, complete, timely, valid, unique in identifier attributes, and consistent.

Data Quality Dimensions. Image by author.

Completeness

Completeness in data quality encompasses the availability of all vital data elements required to fulfill a specific objective. Take, for example, a customer database tailored for marketing purposes; it would be deemed incomplete if essential contact information such as phone numbers or email addresses were missing for certain customers.

To ensure data completeness, organizations can employ data profiling techniques.

Data profiling is the systematic examination and assessment of datasets to uncover patterns, inconsistencies, and anomalies.

By scrutinizing the data meticulously, one can identify gaps, idiosyncrasies or missing values, enabling corrective measures such as sourcing the missing information or implementing robust data validation processes. The result is a more reliable, complete, and actionable dataset that empowers better decision-making, optimized marketing efforts, and ultimately, drive business success.

But before thorough data profiling, the first step in any data quality audit is a review of the data dictionary: a concise, descriptive reference that defines the structure, attributes, and relationships of data elements within a dataset, serving as a guide for understanding and interpreting the data’s meaning and purpose.

Data Dictionary Example. Image by author.

With a thorough review or creation of your data dictionary in hand, assessing completeness becomes a breeze when you leverage the power of low-code libraries such as Sweetviz, Missingno, or Pandas_DQ.

import missingno as msno
import sweetviz as sv
from pandas_dq import dq_report

# completeness check
msno.matrix(df)

# data profiling
Report = sv.analyze(df)

Report.show_notebook()

Personally, I gravitate towards the Pandas-Matplotlib-Seaborn combo, as it provides me with the flexibility to have full control over my output. This way, I can craft an engaging and visually appealing analysis.

# check for missing values
import seaborn as sns
import matplotlib.pyplot as plt

def plot_missing_values(df: pd.DataFrame,
title="Missing Values Plot"):
plt.figure(figsize=(10, 6))

sns.displot(
data=df.isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25
)
plt.title(title)
plt.show()

plot_missing_values(df)

Missing Values Plot. Image by author.

Uniqueness

Uniqueness is a data quality dimension that emphasizes the absence of duplicate data in columns with uniqueness constraint. Each record should represent a unique entity without redundancy. For example, a user list should have unique IDs for each registered user; multiple records with the same ID indicate a lack of uniqueness.

In the below example I’m mimicking a data integration step of merging two identically structured datasets. The Pandas concat function’s argument verify_integrity throws an error if uniqueness is violated:

# verify integrity check
df_loans = pd.concat([df, df_pdf], verify_integrity=True)

# check duplicated ids
df_loans[df_loans.duplicated(keep=False)].sort_index()

Violation of Uniqueness. Image by author.

Ideally, you would check the presence of duplication as part of your data quality audit.

def check_duplicates(df, col):
'''
Check how many duplicates are in col.
'''
# first step set index
df_check = df.set_index(col)
count = df_check.index.duplicated().sum()
del df_check
print("There are {} duplicates in {}".format(count, col))

Timeliness

Timeliness is an aspect of data quality that focuses on the availability and cadence of the data. Up-to-date and readily available data is essential for accurate analysis and decision-making. For example, a timely sales report should include the most recent data possible, not only data from several months prior. The dataset we have been using thus far for the examples doesn’t have a time dimension for us to explore cadence more deeply.

Timeliness example. Image by author.

Validity

As we transition to the concept of validity, one should recognize its role in ensuring that data adheres to the established rules, formats, and standards. Validity guarantees compliance with the schema, constraints, and data types designated for the dataset. We can use the powerful Python library Pydantic for this:

# data validation on the data dictionary
from pydantic import BaseModel, Field, conint, condecimal, constr

class LoanApplication(BaseModel):
Loan_ID: int
Gender: conint(ge=1, le=2)
Married: conint(ge=0, le=1)
Dependents: conint(ge=0, le=3)
Graduate: conint(ge=0, le=1)
Self_Employed: conint(ge=0, le=1)
ApplicantIncome: condecimal(ge=0)
CoapplicantIncome: condecimal(ge=0)
LoanAmount: condecimal(ge=0)
Loan_Amount_Term: condecimal(ge=0)
Credit_History: conint(ge=0, le=1)
Property_Area: conint(ge=1, le=3)
Loan_Status: constr(regex="^[YN]$")

# Sample loan application data
loan_application_data = {
"Loan_ID": 123456,
"Gender": 1,
"Married": 1,
"Dependents": 2,
"Graduate": 1,
"Self_Employed": 0,
"ApplicantIncome": 5000,
"CoapplicantIncome": 2000,
"LoanAmount": 100000,
"Loan_Amount_Term": 360,
"Credit_History": 1,
"Property_Area": 2,
"Loan_Status": "Y"
}

# Validate the data using the LoanApplication Pydantic model
loan_application = LoanApplication(**loan_application_data)

Once tested with an example, we can run the entire dataset through a validation check which should print “no data validation issues” if successful:

# data validation on the data dictionary
from pydantic import ValidationError
from typing import List

# Function to validate DataFrame and return a list of failed LoanApplication objects
def validate_loan_applications(df: pd.DataFrame) -> List[LoanApplication]:
failed_applications = []

for index, row in df.iterrows():
row_dict = row.to_dict()

try:
loan_application = LoanApplication(**row_dict)
except ValidationError as e:
print(f"Validation failed for row {index}: {e}")
failed_applications.append(row_dict)

return failed_applications

# Validate the entire DataFrame
failed_applications = validate_loan_applications(df_loans.reset_index())

# Print the failed loan applications or "No data quality issues"
if not failed_applications:
print("No data validation issues")
else:
for application in failed_applications:
print(f"Failed application: {application}")

We can do the same with pandas_dq, using far less code:

from pandas_dq import DataSchemaChecker

schema = {
'Loan_ID': 'int64',
'Gender': 'int64',
'Married': 'int64',
'Dependents': 'int64',
'Graduate': 'int64',
'Self_Employed': 'int64',
'ApplicantIncome': 'float64',
'CoapplicantIncome': 'float64',
'LoanAmount': 'float64',
'Loan_Amount_Term': 'float64',
'Credit_History': 'int64',
'Property_Area': 'int64',
'Loan_Status': 'object'
}

checker = DataSchemaChecker(schema)

checker.fit(df_loans.reset_index())

This returns an easy to read Pandas dataframe style report that details any validation issues encounter. I’ve provided an incorrect schema where int64 variables have been reported to be float64 variables. The library has correctly identified these:

DataSchemaChecker output. Image by author.

The data type mismatch is rectified with a single line of code using the checker object created from the DataSchemaChecker class:

# fix issues
df_fixed = checker.transform(df_loans.reset_index())
DataSchemaChecker transform output. Image by author.

Pydantic or pandas_dq?

There are some differences between Pydantic and pandas_dq:

  1. Declarative syntax: arguably, Pydantic allows you to define the data schema and validation rules using a more concise and readable syntax. This can make it easier to understand and maintain your code. I find it super helpful to be able to define the ranges of possible values instead of merely the data type.
  2. Built-in validation functions: Pydantic provides various powerful built-in validation functions like conint, condecimal, and constr, which allow you to enforce constraints on your data without having to write custom validation functions.
  3. Comprehensive error handling: When using Pydantic, if the input data does not conform to the defined schema, it raises a ValidationError with detailed information about the errors. This can help you easily identify issues with your data and take necessary action.
  4. Serialization and deserialization: Pydantic automatically handles serialization and deserialization of data, making it convenient to work with different data formats (like JSON) and convert between them.

In conclusion, Pydantic offers a more concise, feature-rich, and user-friendly approach to data validation compared to the DataSchemaChecker class from pandas_dq.

Pydantic is likely a better choice for validating your data schema in a productionized environment. But if you just want to get up and running quickly with a prototype, you might prefer the low-code nature of the DataSchemaChecker.

Accuracy & Consistency

There are 2 further data quality dimensions which we haven’t explored up until now:

  • Accuracy is a data quality dimension that addresses the correctness of data, ensuring it represents real-world situations without errors. For instance, an accurate customer database should contain correct and up-to-date addresses for all customers.
  • Consistency deals with the uniformity of data across different sources or datasets within an organization. Data should be consistent in terms of format, units, and values. For example, a multinational company should report revenue data in a single currency to maintain consistency across its offices in various countries.

You can check all data quality issues present in a dataset using the dq_report function:

from pandas_dq import dq_report

dq_report(df_loans.reset_index(), target=None, verbose=1)

It detects the below data quality issues:

  • Strongly associated variables (multicollinearity)
  • Columns with no variance (redundant features)
  • Asymmetrical data distributions (anomalies, outliers, ect.)
  • Infrequent category occurrences
DQ Report from pandas_dq libray. Image by author.

Performing data quality audits is crucial for maintaining high-quality datasets, which in turn drive better decision-making and business success. Python offers a wealth of libraries and tools that make the auditing process more accessible and efficient.

By understanding and applying the concepts and techniques discussed in this article, you’ll be well-equipped to ensure your datasets meet the necessary quality standards for your projects.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
auditingComprehensiveDataGuidemachine learningMohamedPracticalQualityTech NewsTechnoblenderWarsame
Comments (0)
Add Comment