An Introduction to Preprocessing Data for Machine Learning | by Rebecca Vickery | Aug, 2022

By Jessie Hobb On Aug 30, 2022

Understand how and when to apply common preprocessing techniques

Machine learning algorithms learn patterns that exist in a set of features and use these patterns to predict a given target variable for new unknown data. The resulting trained model is essentially a mathematical function that successfully maps the values of X (the features) to the unknown value of y (the target).

As with all mathematical computations, machine learning algorithms can only work with data represented as numbers. Additionally, as each algorithm works under a variety of different constraints and assumptions, it is important that these numbers are represented in a way that reflects how the algorithm understands the data.

If we imagine we have a feature representing the colour of a car with values of red, blue and grey. If we were to represent each colour as a number, say red = 1, blue = 2, or grey = 3, the machine learning algorithm, with no understanding of the concept of colour, may interpret the colour red as being more important because it is represented by the largest number.

Preprocessing, in machine learning terms, refers to the transformation of raw features into data that a machine learning algorithm can understand and learn from.

As illustrated preprocessing data for machine learning is something of an art form and requires careful consideration of the raw data in order to select the correct strategies and preprocessing techniques.

In the following tutorial, I will give an introduction to common preprocessing steps with code examples predominately using the Scikit-learn library. This article is not meant to be an exhaustive overview of all available preprocessing methods rather it is designed to give a good foundational knowledge of the most commonly used strategies. I have included links towards the end of the article to dive deeper into preprocessing should this article peak your interest.

For the purposes of this tutorial, I will be using the ‘autos’ dataset taken from openml.org. This dataset consists of a number of features relating to the characteristics of a car and a categorical target variable representing its associated insurance risk. The dataset can also be downloaded and transformed into a pandas dataframe using the code below.

Data types for the autos dataset. Image by Author

As mentioned at the beginning of the article machine learning algorithms require numerical data. As a result, any categorical features must first be transformed into numerical features before being used for model training.

The most common technique used to treat categorical variables is known as one hot encoding, sometimes also referred to as dummy encoding. This technique creates a new column for each unique value contained in the feature. The new columns are binary features containing a 0 if the value is not present and a 1 if it is.

The Scikit-learn library provides a preprocessing method that performs one hot encoding. The following code transforms the categorical features in the dataset into one hot encoded columns.

Cardinality for categorical features. Image by Author

The majority of real-world datasets will have some missing values. This could be for a number of reasons. The system generating the data could have errored leading to missing observations, or a value may be missing because it is not relevant for a particular sample.

Whatever the reason the majority of machine learning algorithms cannot interpret null values and it is, therefore, necessary to treat these values in some way.

One option can be to delete the rows that contain missing values. However, this is often not practical as it can either reduce the size of the training dataset too much or the application of the algorithm may require predictions to be generated for all rows.

If dropping the missing values is not an option it will be necessary to replace them with a sensible value. This is a technique known as imputation.

There are numerous strategies for imputing missing values. Ranging from the very simple option of substituting missing values with the median, mean or most frequent for the feature. To the more complex where machine learning algorithms are used to determine the optimal value for imputation.

Before selecting a strategy we first need to understand if our dataset has any missing values. To do this run the following.

Percentage of missing value. Image by Author.

Numerical features in a training set can often have very different scales. For example, the feature ‘price’ has a minimum value of 5,118. Whereas the ‘compression-ratio’ has a minimum value of only 7 and a maximum value of 23. A machine learning model may incorrectly interpret the larger values in the ‘price’ feature as being more important than those within the ‘compression-ratio’ feature.

A further preprocessing step related to scaling is centering where features are transformed so that they form a normal distribution. Many machine learning algorithms make assumptions about features being normally distributed and they will not behave as expected unless the features are represented in this way.

The Scikit-learn StandardScaler method performs both centering and scaling by removing the mean and scaling each feature to unit variance. The below code performs these steps.

Binning or discretization is a technique used to convert continuous variables into groups or buckets of similar values. This technique is particularly useful when a variable has a large number of infrequently occurring values. When this is the case discretization can reduce the noise in a feature and reduce the risk of the model overfitting during training.

In our dataset, the price variable has a very large spread of values. The most frequently occurring price only has 2 occurrences. This is an example of a feature that would particularly benefit from binning.

Once discretization has been performed the feature must then be treated as categorical and so an additional preprocessing step, such as one hot encoding must be performed.

The Scikit-learn library has a method called KBinsDiscretizer which performs both binning and categorical encoding in one step. The following code transforms the price feature into 6 bins and then performs one hot encoding on the new categorical variable. The result is a sparse matrix.

So far throughout this tutorial, we have performed all preprocessing steps independently. In a real machine learning application we will always need to apply preprocessing to both the training set, and any test or validation datasets and then apply this again during inference to new data. It is therefore most efficient to write code that can perform all of these transformations in one step.

Scikit-learn has a useful tool known as pipelines. Scikit-learn pipelines enable preprocessing steps to be chained together along with an estimator. The code below creates a pipeline that performs all of the preprocessing steps outlined in this tutorial and also fits a Random Forest classifier.

The pipeline can be re-reused to preprocess the test dataset and generate predictions as shown below.

Machine learning algorithms do not learn the same way that humans do. An algorithm is incapable of understanding the relationship that the number of doors has to a car in the same way that you and I do. In order for the machine to learn the data has to be transformed into a representation that fits how the algorithm learns.

In this article, we have covered the following preprocessing techniques. Here is a brief summary of the methods and the reasons why they are useful.

Encoding categorical features: The majority of machine learning algorithms can only work with numerical data therefore categorical variables must be converted to a numerical representation.
Imputing missing values: Most machine learning algorithms cannot interpret null values. Imputing replaces missing values with a sensible replacement.
Feature scaling: Machine learning algorithms only understand numerical relationships. Features with varying scales may therefore be incorrectly interpreted. Scaling ensures that the values within continuous features are all on the same scale.
Binning: Continuous variables with many infrequently occurring values can contain a lot of noise which might lead to overfitting during training. Binning aggregates these values into buckets or groups of similar values resulting in a new categorical feature.

This tutorial has given an introductory overview of the most common preprocessing techniques applied to data for machine learning. The methods described here have many different options and there are more possible preprocessing steps.

Once you have grasped an overall understanding of the techniques described and if you would like to dive deeper into more techniques the book ‘Hands-on Machine Learning with Scikit-learn and Tensorflow’ is a great resource. This book is available as a free-to-read PDF via this link.

For more articles on Scikit-learn please see my earlier posts below.

Thanks for reading!

Citation

Autos dataset: Jeffrey, C. Schlimmer. UCI Network Data Repository [https://archive.ics.uci.edu/ml/datasets/Automobile]. Used under the Open Science License.

Understand how and when to apply common preprocessing techniques