Data Cleaning With Pandas and NumPy


Data cleaning is one of the boring yet crucial step in data analysis

Photo by Pixabay

Data cleaning is one of the most time-consuming tasks!

I must admit, the real-world data is always messy and rarely in the clean form. It contains incorrect or abbreviated column names, missing data, incorrect data types, too much information in a single column and so on.

It is important to fix these issues before processing the data. Ultimately, clean data always boosts the productivity and enables you to create best, accurate insights.

Therefore, I listed 3 types of data cleaning you must know while processing it using Python.

For the sake of examples, I’m using an extended version of Titanic dataset created by Pavlo Fesenko which is freely available under CC license.

Titanic Dataset | Image by Author

It is the simple dataset with 1309 rows and 21 columns. I have shown below plenty of examples about how you can get best out of this data.

Let’s get started..🚀

First things first, import pandas and read this csv file in pandas DataFrame. It is a good practice to get a complete overview about the size of the dataset, columns and their respective data types using .info() method.

df = pd.read_csv("Complete_Titanic_Extended_Dataset.csv")
df.info()
df.info() to get an overview of dataset | Image by Author

Let’s start with the easiest cleaning steps first which might save some memory and time as well as you go ahead with the processing.

You can notice, this dataset contains 21 column and you will rarely use all of them for your data analytics task. Therefore, select only the required columns.

For an instance, suppose for you ask you don’t need the columns PassengerId, SibSp, Parch, WikiId, Name_wiki, and Age_wiki.

All you need to do is create a list of these column names and use it in the df.drop() function as shown below.

columns_to_drop = ['PassengerId', 'SibSp', 
'Parch', 'WikiId',
'Name_wiki', 'Age_wiki']
df.drop(columns_to_drop, inplace=True, axis=1)
df.head()
Keep only relevant columns | Image by Author

When you check the memory consumption using the argument memory_usage = "deep" in .info() method, you’ll notice this newly created dataset consumes only 834 KB as opposed to 1000 KB for original DataFrame.

These numbers might look small here but will be significantly large when you deal with the big datasets.

So dropping irrelevant columns saved 17% of the memory!!

A minor drawback of dropping the columns using the method .drop() is it alters the original DataFrame when you use inplace = True. If you are still interested in the original DataFrame, you can assign the df.drop() output (without inplace) to another variable, like below.

df1 = df.drop(columns_to_drop, axis=1)

Alternatively, a scenario may arise when you want to drop a huge number of columns and keep only 4–5 columns. In that case, instead of using df.drop() you should use df.copy() with a selected number of columns.

For example, if you want to use only Name, Sex, Age and Survived columns from the dataset you can subset the original dataset using df.copy() as shown below.

df1 = df[["Name","Age","Sex","Survived"]].copy()
Subset DataSet using list of columns | Image by Author

Depending on what is the actual requirements of your task, you can use any of the above methods to choose only the relevant columns.

In the above picture, you might notice, some values from the columns Age and Survived are missing. And it needs to be addressed before going ahead.

In almost all the data sets, you need to deal with the missing values and it is one of the tricky part of data cleaning. If you want to use this data for machine learning, you should know that, most of the models don’t accept missing data.

But how to find the missing data??

There are a variety of ways to find out from which section, columns in the dataset the values are missing. Below are the four techniques commonly used to find out missing data.

The method .info()

This is one of the simple way to understand if there are missing values in any columns. When you use df.info() you can see a quick overview of the DataFrame df as below.

df.info() to get an overview of DataFrame | Image by Author

The column names shown in the red boxes above are the once from where multiple values are missing. Ideally, each column in this dataset should contain 1309 values, but this output shows that most columns contain values less than 1309.

You can also visualize these missing values.

Heatmap of missing data

It is one of the common ways to visualize the missing data. You can create heatmap of data by coding the data as boolean values i.e. 1 or 0 and you can use pandas function .isna() for it.

What is .isna() in pandas??

The method isna() returns a DataFrame object where all the values are replaced with a Boolean value True for NaN, and otherwise False.

All you need to do is type in just one line of code as below.

import seaborn as sns
sns.heatmap(df.isna())
Heatmap of missing values | Image by Author

The X-axis in the above graph shows all the column names whereas Y-axis represents index or row numbers. The legend on the right side tells you about Boolean values used to denote the missing data.

This altogether helps you to understand in which part or between which index numbers the data is missing from a specific column.

Well, if the column names are not easily readable you can always create its transposed version as below.

sns.heatmap(df.isna().transpose())
Heatmap of missing values | Image by Author

Such heatmaps are useful when there are smaller number of features or columns. If there is a huge number of features, you can always subset it.

But, keep in mind that the visualization takes time to create if the dataset is large.

Although heatmaps gives you an idea about the location of the missing data, it does not tell you about the amount of missing data. And you can get it using the next method.

Missing data as a percentage of total data

There is no straightforward method to get it, but all you can use is the .isna() method and below a piece of code.

import numpy as np
print("Amount of missing values in - ")
for column in df.columns:
percentage_missing = np.mean(df
.isna())

print(f'{column} : {round(percentage_missing*100)}%')
Missing values percentage in Pandas DataFrame | Image by Author

In this way you can see — how many percentages — values are missing from individual columns. This can be useful while handling these missing values.

I identified the missing data, but what next??

There is no standard way for dealing with the missing data. The only way is to look at the individual column, amount of missing values in it and importance of that column in the future.

Based on above observations you can use any of the below 3 methods to handle missing data.

  1. Drop the record — Drop an entire record at an index, when a specific column has a missing value or NaN at it. Please be aware that this technique can drastically reduce the number of records in the dataset if the mentioned column has a huge number of missing values.
  2. Drop the column or feature — This needs good research of a specific column to understand its importance in the future. You can do this only when you are confident that this feature does not provide any useful information, for example, PassengerId feature in this dataset.
  3. Impute missing data—In this technique, you can substitute the missing values or NaNs with the mean or median or mode of the same column.

All these ways of handling missing data is a good discussion topic which I’ll cover in the next article.

Apart from the missing data, another common issue with the data is incorrect data types which needs to be address to have good quality data.

While working with different Python libraries you can notice that a particular data type is needed to do a specific transformation. Therefore, data type of each column should be correct & appropriate for its future use.

When you use read_csv or any other read_ function in pandas to get the data into DataFrame, pandas will try to guess the data type of each column by observing values stored in it.

This guess it almost correct for all the column except few ones. And you need to correct data types of such columns manually.

For example, in Titanic dataset you can see column data types using .info() as below.

Incorrect data types in pandas DataFrame | Image by Author

In the above output, the columns Age and Survived have data type as float64, however Age should always be an integer and Survived should only two types of values — Yes or No.

To understand it better, let’s look at random 5 values in these columns.

df[["Name","Sex","Survived","Age"]].sample(5)
Sample 5 rows | Image by Author

Apart from missing values, the survived column has two values — 0.0 & 1.0 — which should be ideally 0 and 1 as Boolean for No & Yes, respectively. Also, the Age column contains values in the decimal format.

Before proceeding you can fix this issue using the correct column types. Depending on your pandas version you might need to deal with the missing values before correcting the data types.

Along with above data cleaning steps, you might need some of the below data cleaning ways as well depending on your use-case.

  1. Replace values in a column — Sometimes columns in your dataset contain values such as True — False, Yes — No which can be easily replaced with 1 & 0 to make the dataset usable for machine learning applications.
  2. Remove outliers — Outliers are the data points which differ significantly from other observations. However, it is not always a good idea to drop an outlier. It needs careful evaluation of these significantly different data points.
  3. Remove duplicates — You can consider data as duplicate when all values in all the columns within the records are same. And pandas DataFrame method .drop_duplicates() is quite handy to remove duplicates.

That’s all!


Data cleaning is one of the boring yet crucial step in data analysis

Photo by Pixabay

Data cleaning is one of the most time-consuming tasks!

I must admit, the real-world data is always messy and rarely in the clean form. It contains incorrect or abbreviated column names, missing data, incorrect data types, too much information in a single column and so on.

It is important to fix these issues before processing the data. Ultimately, clean data always boosts the productivity and enables you to create best, accurate insights.

Therefore, I listed 3 types of data cleaning you must know while processing it using Python.

For the sake of examples, I’m using an extended version of Titanic dataset created by Pavlo Fesenko which is freely available under CC license.

Titanic Dataset | Image by Author

It is the simple dataset with 1309 rows and 21 columns. I have shown below plenty of examples about how you can get best out of this data.

Let’s get started..🚀

First things first, import pandas and read this csv file in pandas DataFrame. It is a good practice to get a complete overview about the size of the dataset, columns and their respective data types using .info() method.

df = pd.read_csv("Complete_Titanic_Extended_Dataset.csv")
df.info()
df.info() to get an overview of dataset | Image by Author

Let’s start with the easiest cleaning steps first which might save some memory and time as well as you go ahead with the processing.

You can notice, this dataset contains 21 column and you will rarely use all of them for your data analytics task. Therefore, select only the required columns.

For an instance, suppose for you ask you don’t need the columns PassengerId, SibSp, Parch, WikiId, Name_wiki, and Age_wiki.

All you need to do is create a list of these column names and use it in the df.drop() function as shown below.

columns_to_drop = ['PassengerId', 'SibSp', 
'Parch', 'WikiId',
'Name_wiki', 'Age_wiki']
df.drop(columns_to_drop, inplace=True, axis=1)
df.head()
Keep only relevant columns | Image by Author

When you check the memory consumption using the argument memory_usage = "deep" in .info() method, you’ll notice this newly created dataset consumes only 834 KB as opposed to 1000 KB for original DataFrame.

These numbers might look small here but will be significantly large when you deal with the big datasets.

So dropping irrelevant columns saved 17% of the memory!!

A minor drawback of dropping the columns using the method .drop() is it alters the original DataFrame when you use inplace = True. If you are still interested in the original DataFrame, you can assign the df.drop() output (without inplace) to another variable, like below.

df1 = df.drop(columns_to_drop, axis=1)

Alternatively, a scenario may arise when you want to drop a huge number of columns and keep only 4–5 columns. In that case, instead of using df.drop() you should use df.copy() with a selected number of columns.

For example, if you want to use only Name, Sex, Age and Survived columns from the dataset you can subset the original dataset using df.copy() as shown below.

df1 = df[["Name","Age","Sex","Survived"]].copy()
Subset DataSet using list of columns | Image by Author

Depending on what is the actual requirements of your task, you can use any of the above methods to choose only the relevant columns.

In the above picture, you might notice, some values from the columns Age and Survived are missing. And it needs to be addressed before going ahead.

In almost all the data sets, you need to deal with the missing values and it is one of the tricky part of data cleaning. If you want to use this data for machine learning, you should know that, most of the models don’t accept missing data.

But how to find the missing data??

There are a variety of ways to find out from which section, columns in the dataset the values are missing. Below are the four techniques commonly used to find out missing data.

The method .info()

This is one of the simple way to understand if there are missing values in any columns. When you use df.info() you can see a quick overview of the DataFrame df as below.

df.info() to get an overview of DataFrame | Image by Author

The column names shown in the red boxes above are the once from where multiple values are missing. Ideally, each column in this dataset should contain 1309 values, but this output shows that most columns contain values less than 1309.

You can also visualize these missing values.

Heatmap of missing data

It is one of the common ways to visualize the missing data. You can create heatmap of data by coding the data as boolean values i.e. 1 or 0 and you can use pandas function .isna() for it.

What is .isna() in pandas??

The method isna() returns a DataFrame object where all the values are replaced with a Boolean value True for NaN, and otherwise False.

All you need to do is type in just one line of code as below.

import seaborn as sns
sns.heatmap(df.isna())
Heatmap of missing values | Image by Author

The X-axis in the above graph shows all the column names whereas Y-axis represents index or row numbers. The legend on the right side tells you about Boolean values used to denote the missing data.

This altogether helps you to understand in which part or between which index numbers the data is missing from a specific column.

Well, if the column names are not easily readable you can always create its transposed version as below.

sns.heatmap(df.isna().transpose())
Heatmap of missing values | Image by Author

Such heatmaps are useful when there are smaller number of features or columns. If there is a huge number of features, you can always subset it.

But, keep in mind that the visualization takes time to create if the dataset is large.

Although heatmaps gives you an idea about the location of the missing data, it does not tell you about the amount of missing data. And you can get it using the next method.

Missing data as a percentage of total data

There is no straightforward method to get it, but all you can use is the .isna() method and below a piece of code.

import numpy as np
print("Amount of missing values in - ")
for column in df.columns:
percentage_missing = np.mean(df
.isna())

print(f'{column} : {round(percentage_missing*100)}%')
Missing values percentage in Pandas DataFrame | Image by Author

In this way you can see — how many percentages — values are missing from individual columns. This can be useful while handling these missing values.

I identified the missing data, but what next??

There is no standard way for dealing with the missing data. The only way is to look at the individual column, amount of missing values in it and importance of that column in the future.

Based on above observations you can use any of the below 3 methods to handle missing data.

  1. Drop the record — Drop an entire record at an index, when a specific column has a missing value or NaN at it. Please be aware that this technique can drastically reduce the number of records in the dataset if the mentioned column has a huge number of missing values.
  2. Drop the column or feature — This needs good research of a specific column to understand its importance in the future. You can do this only when you are confident that this feature does not provide any useful information, for example, PassengerId feature in this dataset.
  3. Impute missing data—In this technique, you can substitute the missing values or NaNs with the mean or median or mode of the same column.

All these ways of handling missing data is a good discussion topic which I’ll cover in the next article.

Apart from the missing data, another common issue with the data is incorrect data types which needs to be address to have good quality data.

While working with different Python libraries you can notice that a particular data type is needed to do a specific transformation. Therefore, data type of each column should be correct & appropriate for its future use.

When you use read_csv or any other read_ function in pandas to get the data into DataFrame, pandas will try to guess the data type of each column by observing values stored in it.

This guess it almost correct for all the column except few ones. And you need to correct data types of such columns manually.

For example, in Titanic dataset you can see column data types using .info() as below.

Incorrect data types in pandas DataFrame | Image by Author

In the above output, the columns Age and Survived have data type as float64, however Age should always be an integer and Survived should only two types of values — Yes or No.

To understand it better, let’s look at random 5 values in these columns.

df[["Name","Sex","Survived","Age"]].sample(5)
Sample 5 rows | Image by Author

Apart from missing values, the survived column has two values — 0.0 & 1.0 — which should be ideally 0 and 1 as Boolean for No & Yes, respectively. Also, the Age column contains values in the decimal format.

Before proceeding you can fix this issue using the correct column types. Depending on your pandas version you might need to deal with the missing values before correcting the data types.

Along with above data cleaning steps, you might need some of the below data cleaning ways as well depending on your use-case.

  1. Replace values in a column — Sometimes columns in your dataset contain values such as True — False, Yes — No which can be easily replaced with 1 & 0 to make the dataset usable for machine learning applications.
  2. Remove outliers — Outliers are the data points which differ significantly from other observations. However, it is not always a good idea to drop an outlier. It needs careful evaluation of these significantly different data points.
  3. Remove duplicates — You can consider data as duplicate when all values in all the columns within the records are same. And pandas DataFrame method .drop_duplicates() is quite handy to remove duplicates.

That’s all!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewscleaningDataNumPyPandasTech NewsTechnoblender
Comments (0)
Add Comment