4 Data Preprocessing Operations with Scikit-learn | by Soner Yıldırım | Dec, 2022

By Jessie Hobb On Dec 8, 2022

Help the algorithm by making the data proper

Data preprocessing is a fundamental step in a machine learning pipeline. It depends on the algorithm being used but, in general, we cannot or should not expect algorithms to perform well with the raw data.

Even well-structured models might fail to produce acceptable results if the raw data is not processed properly.

Some might consider using the term data preparation to cover data cleaning and data preprocessing operations. The focus of this article is the data preprocessing part.

For instance, some algorithms require the numerical features to be scaled to similar levels. Otherwise, they tend to give more importance to the features that have a higher value range.

Consider a house price prediction task. The area of houses usually varies between 1000 and 2000 square feet whereas the age is less than 50 in most cases. What we would do to prevent a machine learning model from giving more importance to the house area is to scale these features to lie between a given minimum and maximum value such as between 0 and 1. This process is called MinMaxScaling.

We will go over 4 commonly used data preprocessing operations including code snippets that explain how to do them with Scikit-learn.

We will be using a bank churn dataset, which is available on Kaggle with a creative commons license. Feel free to download it and follow along.

import pandas as pd# Read the dataset (only 5 columns) into a Pandas DataFrame
churn = pd.read_csv(
"BankChurners.csv",
usecols=["Attrition_Flag", "Marital_Status", "Card_Category", "Customer_Age", "Total_Trans_Amt"]
)
churn.head()

The first 5 rows of the DataFrame (image by author)

A very important thing to mention here is train-test split, which is of crucial importance for assessing the model performance. Just like we train models with data, we measure their accuracies with data. But, we cannot use the same data for both training and testing.

Before training the model, we should set aside some data for testing. This is known as train-test split and it must be done before any data preprocessing operation. Otherwise, we would be causing data leakage, which basically means the model learning about the properties of the test data.

Hence, all of the following operations must be done after the train-test split. Consider the DataFrame we have (churn) only includes the training data.

Real-life datasets are highly likely to include some missing values. There are two approaches to handle them, which are dropping the missing values and replacing them with proper values.

In general, the latter is better because data is the most valuable asset in the data-based product and we do not want to waste it. The proper value to replace a missing value depends on the characteristics and the structure of the dataset.

The dataset we are using does not have any missing values so let’s add some on purpose to demonstrate how to handle them.

import numpy as npchurn.iloc[np.random.randint(0, 1000, size=25), 1] = np.nan
churn.iloc[np.random.randint(0, 1000, size=25), 4] = np.nan
churn.isna().sum()
# output
Attrition_Flag      0
Customer_Age       24
Marital_Status      0
Card_Category       0
Total_Trans_Amt    24
dtype: int64

In the code snippet above, a NumPy array with 25 random integers is used for selecting the index of the rows whose value in the second and fifth column are replaced with a missing value (np.nan).

In the output, we see that there are 24 missing values in these columns because NumPy arrays are randomly generated and might include duplicate values.

To handle these missing values, we can use the SimpleImputer class, which is an example of univariate feature imputation. Simple imputer provides basic strategies for imputing missing values, which can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located.

Let’s use the mean value of the column to replace the missing values.

from sklearn.impute import SimpleImputer# Create an imputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Apply it to the numeric columns
numeric_features = ["Customer_Age", "Total_Trans_Amt"]
churn[numeric_features] = imputer.fit_transform(churn[numeric_features])
churn.isna().sum()
# output
Attrition_Flag     0
Customer_Age       0
Marital_Status     0
Card_Category      0
Total_Trans_Amt    0
dtype: int64

In the code snippet above, a simple imputer object is created with mean strategy, which means it imputes the missing values using the mean value of the column. Then, we use it for replacing the missing values in the customer age and total transaction amount columns.

Scikit-learn also provides more sophisticated methods for imputing missing values. For instance, the IterativeImputer class is an example of multivariate feature imputation and it models each feature with missing values as a function of other features, and uses that estimate for imputation.

We mentioned that a feature that has a higher value range compared to other features might be given more importance, which might be misleading. Moreover, models tend to perform better and converge faster when the features are on a relatively similar scale.

One option to handle features with very different value ranges is standardization, which basically means transforming the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. The resulting features have a standard deviation of 1 and a mean that is very close to zero. Thus, we end up having features (i.e. variables or columns in a dataset) that have almost a normal distribution.

Let’s apply the StandardScaler class of Scikit-learn to the customer age and total transaction amount columns. As we see in the output below, these two columns have highly different value ranges.

churn[["Customer_Age", "Total_Trans_Amt"]].head()

Let’s apply standardization on these features and check the values afterwards.

from sklearn.preprocessing import StandardScaler# Create a scaler object
scaler = StandardScaler()
# Fit training data
scaler.fit(churn[["Customer_Age", "Total_Trans_Amt"]])
# Transform the feature values
churn[["Customer_Age", "Total_Trans_Amt"]] = scaler.transform(churn[["Customer_Age", "Total_Trans_Amt"]])
# Display the transformed features
churn[["Customer_Age", "Total_Trans_Amt"]].head()

Let’s also check the standard deviation and mean value of a transformed feature.

churn["Customer_Age"].apply(["mean", "std"])# output
mean   -7.942474e-16
std     1.000049e+00
Name: Customer_Age, dtype: float64

The standard deviation is 1 and the mean is very close to 0 as expected.

Another way of bringing the value ranges to a similar level is scaling them to a specific range. For instance, we can squeeze each column between 0 and 1 in a way that minimum and maximum values before scaling become 0 and 1 after scaling. This kind of scaling can be achieved by MinMaxScaler of Scikit-learn.

from sklearn.preprocessing import MinMaxScaler# Create a scaler object
mm_scaler = MinMaxScaler()
# Fit training data
mm_scaler.fit(churn[["Customer_Age", "Total_Trans_Amt"]])
# Transform the feature values
churn[["Customer_Age", "Total_Trans_Amt"]] = mm_scaler.transform(churn[["Customer_Age", "Total_Trans_Amt"]])
# check the feature value range after transformation
churn["Customer_Age"].apply(["min", "max"])
# output
min    0.0
max    1.0
Name: Customer_Age, dtype: float64

As we see in the output above, the minimum and maximum values of these features are 0 and 1, respectively. The default range for the MinMaxScaler is [0,1] but we can change it using the feature_range parameter.

StandardScaler and MinMaxScaler are not robust to outliers. Consider we have a feature whose values are in between 100 and 500 with an exceptional value of 25000. If we scale this feature with MinMaxScaler(feature_range=(0,1)), 25000 is scaled as 1 and all the other values become very close to the lower bound which is zero.

Thus, we end up having a disproportionate scale which negatively affects the performance of a model. One solution is to remove the outliers and then apply scaling. However, it may not always be a good practice to remove outliers. In such cases, we can use the RobustScaler class of Scikit-learn.

RobustScaler, as the name suggests, is robust to outliers. It removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). RobustScaler does not limit the scaled range by a predetermined interval. Thus, we do not need to specify a range like we do for MinMaxScaler.

We often work with datasets that have categorical features, which also require some preprocessing just like numerical features.

Some algorithms expect the categorical variables in numeric or one-hot encoded format. Label encoding simply means converting categories into numbers. For instance, a feature of size that has the values S, M, and L will be converted to a feature with values 1, 2, and 3.

If a categorical variable is not ordinal (i.e. there is not a hierarchical order in them), label encoding is not enough. We need to encode nominal categorical variables using one-hot encoding.

Consider the previous example where we did label-encoding on the marital status feature. The unknown status is encoded to 3 whereas married status is 1. A machine learning model might evaluate this as unknown status is superior to or higher than the married status, which is not true. There is no hierarchical relationship between these values.

In such cases, it is better to do one-hot encoding, which creates a binary column for each category. Let’s apply it to the marital status column.

from sklearn.preprocessing import OneHotEncoder# Create a one-hot encoder
onehot = OneHotEncoder()
# Create an encoded feature
encoded_features = onehot.fit_transform(churn[["Marital_Status"]]).toarray()
# Create DataFrame with the encoded features
encoded_df = pd.DataFrame(encoded_features, columns=onehot.categories_)
# Display the first 5 rows
encoded_df.head()

Since there are 4 different values in the marital status column (Divorced, Married, Single, Unknown), 4 binary columns are created. The first value of the marital status column is “Married” so the Married column takes a value of 1 in the first row. All the other values in the first row are 0.

One important thing to mention is the drop parameter. If there are n distinct values in a categorical column, we can do one-hot encoding with n-1 columns because one of the columns is actually redundant. For instance, in the output above, when the value of 3 columns is 0, then that row belongs to the fourth column. We don’t actually need the fourth column to know this. We can use the drop parameter of OneHotEncoder to drop one of the columns.

We have learned some of the most frequently done data preprocessing operations in machine learning and how to perform them using the Scikit-learn library.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.