Supercharge your Data Cleaning Game with this New Tool | by David Farrugia | Apr, 2023


PYTHON | DATA | ANALYTICS

A guide to leveraging pandas_dq to effortlessly perform data cleaning

Photo by JESHOOTS.COM on Unsplash

If the title of this article peeked your interest, then you are definitely aware on the critical importance that the data cleaning and pre-processing step has on the overall analytics project.

If you are preparing to train a machine learning model or you simply want to perform some exploratory data analysis, dirty data is a sure obstacle in your way. You’ve probably heard the saying that preparing your data is 80% of the job.

The data cleaning process is perhaps one of the most time consuming, exhaustive, and often frustrating parts of the data analytics process. Looking for duplicates, multicollinearity, missing or infinite values, to name a few, is precious time taken away from understanding the data and drawing actionable insights.

In this article, we will discuss the awesome Python package that is pandas_dqand how it can improve the speed and quality of your next data cleaning task.

We need to install the package. pandas_dq is available through:

pip install pandas_dq

Alternatively, you can install from source:

# download from https://github.com/AutoViML/pandas_dq/archive/master.zip
cd <pandas_dq_Destination>
git clone git@github.com:AutoViML/pandas_dq.git

Given a dataset, getting started with the tool is super simple. It currently has 3 main functions:

  • dq_report
  • Fix_DQ
  • DataSchemaChecker

dq_report

The purpose of this function is to generate a report with all the data quality issues that are present in our dataset.

Suppose we have the iris dataset

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

we can run the dq_report as follows:

from pandas_dq import dq_report

dq_report(df, target=None, verbose=1)

And this gives us the following result back:

Image by Author

The package quickly tells us that the feature sepal_width has 4 outliers and suggests that we either cap them (set any outlier value to a maximum value) or remove them. It can also identify whether we have multi-collinear features (features with a high correlation), as is the case with petal_length and petal_width for instance.

We also have the option to run target-focused data quality checks. In supervised learning tasks (whenever we have a target/label to predict), we need to also check the relationship between the features and our target objective. dq_report makes this super easy for us by allowing us to specify to target column.

Note: the target column cannot be a non-numerical value

# map string target to numeric
df['species'] = pd.factorize(df['species'])[0]

dq_report(df, target='species', verbose=1)

Image by Author

This function performs all of these checks:

  1. It detects ID columns
  2. It detects zero-variance columns
  3. It identifies rare categories (less than 5% of categories in a column)
  4. It finds infinite values in a column
  5. It detects mixed data types (i.e. a column that has more than a single data type)
  6. It detects outliers (i.e. a float column that is beyond the Inter Quartile Range)
  7. It detects high cardinality features (i.e. a feature that has more than 100 categories)
  8. It detects highly correlated features (i.e. two features that have an absolute correlation higher than 0.8)
  9. It detects duplicate rows (i.e. the same row occurs more than once in the dataset)
  10. It detects duplicate columns (i.e. the same column occurs twice or more in the dataset)
  11. It detects skewed distributions (i.e. a feature that has a skew more than 1.0)
  12. It detects imbalanced classes (i.e. target variable has one class more than other in a significant way)
  13. It detects feature leakage (i.e. a feature that is highly correlated to target with correlation > 0.8)

Fix_DQ

This function performs all of the same checks done by dq_report but also actions them in a single line of code. This is usually done on a set of features (excluding the target column) a preparation for modelling.

from pandas_dq import Fix_DQ

fdq = Fix_DQ()
result = fdq.fit_transform(df.drop('species', axis=1))

We get the following output:

Alert: Detecting 1 duplicate rows...
Dropping petal_length which has a high correlation with ['sepal_length']
Dropping petal_width which has a high correlation with ['sepal_length', 'petal_length']
Alert: Dropping 1 duplicate rows can sometimes cause column data types to change to object. Double-check!

and the resultant dataframe:

Image by Author

DataSchemaChecker

This function takes in a data schema and ensures that our dataframe adheres to that scheme. This is useful for when we want to ensure our column data types, either because we want to ensure consistent data types when using a pre-trained model, perform some type validation, serialisation, or even ingest it into a database.

The main quirk (perhaps a bug) of this function is that it will output an AttributeError in case when there are no data type issues.

from pandas_dq import DataSchemaChecker

wrong_schema = dict(zip(df.columns, ['float64', 'float64', 'float64', 'float64', 'float64']))
ds = DataSchemaChecker(schema=wrong_schema)
ds.fit_transform(df)

Image by Author


PYTHON | DATA | ANALYTICS

A guide to leveraging pandas_dq to effortlessly perform data cleaning

Photo by JESHOOTS.COM on Unsplash

If the title of this article peeked your interest, then you are definitely aware on the critical importance that the data cleaning and pre-processing step has on the overall analytics project.

If you are preparing to train a machine learning model or you simply want to perform some exploratory data analysis, dirty data is a sure obstacle in your way. You’ve probably heard the saying that preparing your data is 80% of the job.

The data cleaning process is perhaps one of the most time consuming, exhaustive, and often frustrating parts of the data analytics process. Looking for duplicates, multicollinearity, missing or infinite values, to name a few, is precious time taken away from understanding the data and drawing actionable insights.

In this article, we will discuss the awesome Python package that is pandas_dqand how it can improve the speed and quality of your next data cleaning task.

We need to install the package. pandas_dq is available through:

pip install pandas_dq

Alternatively, you can install from source:

# download from https://github.com/AutoViML/pandas_dq/archive/master.zip
cd <pandas_dq_Destination>
git clone git@github.com:AutoViML/pandas_dq.git

Given a dataset, getting started with the tool is super simple. It currently has 3 main functions:

  • dq_report
  • Fix_DQ
  • DataSchemaChecker

dq_report

The purpose of this function is to generate a report with all the data quality issues that are present in our dataset.

Suppose we have the iris dataset

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

we can run the dq_report as follows:

from pandas_dq import dq_report

dq_report(df, target=None, verbose=1)

And this gives us the following result back:

Image by Author

The package quickly tells us that the feature sepal_width has 4 outliers and suggests that we either cap them (set any outlier value to a maximum value) or remove them. It can also identify whether we have multi-collinear features (features with a high correlation), as is the case with petal_length and petal_width for instance.

We also have the option to run target-focused data quality checks. In supervised learning tasks (whenever we have a target/label to predict), we need to also check the relationship between the features and our target objective. dq_report makes this super easy for us by allowing us to specify to target column.

Note: the target column cannot be a non-numerical value

# map string target to numeric
df['species'] = pd.factorize(df['species'])[0]

dq_report(df, target='species', verbose=1)

Image by Author

This function performs all of these checks:

  1. It detects ID columns
  2. It detects zero-variance columns
  3. It identifies rare categories (less than 5% of categories in a column)
  4. It finds infinite values in a column
  5. It detects mixed data types (i.e. a column that has more than a single data type)
  6. It detects outliers (i.e. a float column that is beyond the Inter Quartile Range)
  7. It detects high cardinality features (i.e. a feature that has more than 100 categories)
  8. It detects highly correlated features (i.e. two features that have an absolute correlation higher than 0.8)
  9. It detects duplicate rows (i.e. the same row occurs more than once in the dataset)
  10. It detects duplicate columns (i.e. the same column occurs twice or more in the dataset)
  11. It detects skewed distributions (i.e. a feature that has a skew more than 1.0)
  12. It detects imbalanced classes (i.e. target variable has one class more than other in a significant way)
  13. It detects feature leakage (i.e. a feature that is highly correlated to target with correlation > 0.8)

Fix_DQ

This function performs all of the same checks done by dq_report but also actions them in a single line of code. This is usually done on a set of features (excluding the target column) a preparation for modelling.

from pandas_dq import Fix_DQ

fdq = Fix_DQ()
result = fdq.fit_transform(df.drop('species', axis=1))

We get the following output:

Alert: Detecting 1 duplicate rows...
Dropping petal_length which has a high correlation with ['sepal_length']
Dropping petal_width which has a high correlation with ['sepal_length', 'petal_length']
Alert: Dropping 1 duplicate rows can sometimes cause column data types to change to object. Double-check!

and the resultant dataframe:

Image by Author

DataSchemaChecker

This function takes in a data schema and ensures that our dataframe adheres to that scheme. This is useful for when we want to ensure our column data types, either because we want to ensure consistent data types when using a pre-trained model, perform some type validation, serialisation, or even ingest it into a database.

The main quirk (perhaps a bug) of this function is that it will output an AttributeError in case when there are no data type issues.

from pandas_dq import DataSchemaChecker

wrong_schema = dict(zip(df.columns, ['float64', 'float64', 'float64', 'float64', 'float64']))
ds = DataSchemaChecker(schema=wrong_schema)
ds.fit_transform(df)

Image by Author

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsAprcleaningDataDavidFarrugiagameSuperchargeTechnoblenderTechnologyTool
Comments (0)
Add Comment