Techno Blender
Digitally Yours.

An Introduction to Preprocessing Data for Machine Learning | by Rebecca Vickery | Aug, 2022

0 59


Understand how and when to apply common preprocessing techniques

Photo by Markus Krisetya on Unsplash

Machine learning algorithms learn patterns that exist in a set of features and use these patterns to predict a given target variable for new unknown data. The resulting trained model is essentially a mathematical function that successfully maps the values of X (the features) to the unknown value of y (the target).

As with all mathematical computations, machine learning algorithms can only work with data represented as numbers. Additionally, as each algorithm works under a variety of different constraints and assumptions, it is important that these numbers are represented in a way that reflects how the algorithm understands the data.

If we imagine we have a feature representing the colour of a car with values of red, blue and grey. If we were to represent each colour as a number, say red = 1, blue = 2, or grey = 3, the machine learning algorithm, with no understanding of the concept of colour, may interpret the colour red as being more important because it is represented by the largest number.

Preprocessing, in machine learning terms, refers to the transformation of raw features into data that a machine learning algorithm can understand and learn from.

As illustrated preprocessing data for machine learning is something of an art form and requires careful consideration of the raw data in order to select the correct strategies and preprocessing techniques.

In the following tutorial, I will give an introduction to common preprocessing steps with code examples predominately using the Scikit-learn library. This article is not meant to be an exhaustive overview of all available preprocessing methods rather it is designed to give a good foundational knowledge of the most commonly used strategies. I have included links towards the end of the article to dive deeper into preprocessing should this article peak your interest.

For the purposes of this tutorial, I will be using the ‘autos’ dataset taken from openml.org. This dataset consists of a number of features relating to the characteristics of a car and a categorical target variable representing its associated insurance risk. The dataset can also be downloaded and transformed into a pandas dataframe using the code below.

Before embarking on preprocessing it is important to get an understanding of the data types for each column. If we run df.dtypes we can see that the dataset has a mixture of both categorical and numerical data types.

Data types for the autos dataset. Image by Author

As mentioned at the beginning of the article machine learning algorithms require numerical data. As a result, any categorical features must first be transformed into numerical features before being used for model training.

The most common technique used to treat categorical variables is known as one hot encoding, sometimes also referred to as dummy encoding. This technique creates a new column for each unique value contained in the feature. The new columns are binary features containing a 0 if the value is not present and a 1 if it is.

The Scikit-learn library provides a preprocessing method that performs one hot encoding. The following code transforms the categorical features in the dataset into one hot encoded columns.

When transforming categorical columns using this method special attention must be paid to the cardinality of the feature.

Cardinality refers to the number of unique values in a given column. If, for example, we have a feature with 50 unique values. Performing one hot encoding would result in 50 columns being created.

This could lead to two possible problems;

  1. A very large training set leading to significantly longer training times.
  2. A sparse training set that can lead to problems with overfitting.

We can get an understanding of the cardinality of the features in our dataset by running the following, df[categorical_cols].nunique().

We can see that the make column has a reasonably high cardinality.

Cardinality for categorical features. Image by Author

One way to treat high cardinality categories is to aggregate the infrequently occurring values into a new category. The OneHotEncoder method provides two options for this.

The first option is to set the min_frequency argument to a chosen number. This will result in any values with a smaller frequency than this value being added to the infrequent category.

Please note this option is currently only available with Scikit-learn versions 1.1.0 and above.

The second option is to set the max_categories argument to any number greater than 1. This limits the number of columns produced to that number or fewer.

The majority of real-world datasets will have some missing values. This could be for a number of reasons. The system generating the data could have errored leading to missing observations, or a value may be missing because it is not relevant for a particular sample.

Whatever the reason the majority of machine learning algorithms cannot interpret null values and it is, therefore, necessary to treat these values in some way.

One option can be to delete the rows that contain missing values. However, this is often not practical as it can either reduce the size of the training dataset too much or the application of the algorithm may require predictions to be generated for all rows.

If dropping the missing values is not an option it will be necessary to replace them with a sensible value. This is a technique known as imputation.

There are numerous strategies for imputing missing values. Ranging from the very simple option of substituting missing values with the median, mean or most frequent for the feature. To the more complex where machine learning algorithms are used to determine the optimal value for imputation.

Before selecting a strategy we first need to understand if our dataset has any missing values. To do this run the following.

We can see from the results that 5 features have missing values and that the percentage of missing values is low (under 2%) for all except the ‘normalized-losses’ column.

Percentage of missing value. Image by Author.

Ordinarily, we would perform some exploratory analysis for each feature to inform the selection of the strategy for imputation. However, for the purposes of this tutorial, I will simply show an example of using a simple strategy and a more complex strategy.

The code shown below uses the Scikit-learn method known as SimpleImputer. As we have a mixture of categorical and numerical features with missing values we will use two different simple strategies to impute them. For numerical features, we will substitute missing values with the mean for that column, and for categorical features, we will use the most frequently occurring value.

Simply filling all missing values with a simple statistic such as the mean may not result in the most optimal performance when the data is used for training. A more complex method for imputation is to use a machine learning algorithm to inform the value to impute.

A commonly used technique for this is the K-Nearest Neighbours algorithm. This model uses a distance metric, such as the Euclidean distance, to determine a specified set of nearest neighbours and imputes the mean value for those neighbours.

Numerical features in a training set can often have very different scales. For example, the feature ‘price’ has a minimum value of 5,118. Whereas the ‘compression-ratio’ has a minimum value of only 7 and a maximum value of 23. A machine learning model may incorrectly interpret the larger values in the ‘price’ feature as being more important than those within the ‘compression-ratio’ feature.

A further preprocessing step related to scaling is centering where features are transformed so that they form a normal distribution. Many machine learning algorithms make assumptions about features being normally distributed and they will not behave as expected unless the features are represented in this way.

The Scikit-learn StandardScaler method performs both centering and scaling by removing the mean and scaling each feature to unit variance. The below code performs these steps.

Binning or discretization is a technique used to convert continuous variables into groups or buckets of similar values. This technique is particularly useful when a variable has a large number of infrequently occurring values. When this is the case discretization can reduce the noise in a feature and reduce the risk of the model overfitting during training.

In our dataset, the price variable has a very large spread of values. The most frequently occurring price only has 2 occurrences. This is an example of a feature that would particularly benefit from binning.

Once discretization has been performed the feature must then be treated as categorical and so an additional preprocessing step, such as one hot encoding must be performed.

The Scikit-learn library has a method called KBinsDiscretizer which performs both binning and categorical encoding in one step. The following code transforms the price feature into 6 bins and then performs one hot encoding on the new categorical variable. The result is a sparse matrix.

So far throughout this tutorial, we have performed all preprocessing steps independently. In a real machine learning application we will always need to apply preprocessing to both the training set, and any test or validation datasets and then apply this again during inference to new data. It is therefore most efficient to write code that can perform all of these transformations in one step.

Scikit-learn has a useful tool known as pipelines. Scikit-learn pipelines enable preprocessing steps to be chained together along with an estimator. The code below creates a pipeline that performs all of the preprocessing steps outlined in this tutorial and also fits a Random Forest classifier.

The pipeline can be re-reused to preprocess the test dataset and generate predictions as shown below.

Machine learning algorithms do not learn the same way that humans do. An algorithm is incapable of understanding the relationship that the number of doors has to a car in the same way that you and I do. In order for the machine to learn the data has to be transformed into a representation that fits how the algorithm learns.

In this article, we have covered the following preprocessing techniques. Here is a brief summary of the methods and the reasons why they are useful.

  1. Encoding categorical features: The majority of machine learning algorithms can only work with numerical data therefore categorical variables must be converted to a numerical representation.
  2. Imputing missing values: Most machine learning algorithms cannot interpret null values. Imputing replaces missing values with a sensible replacement.
  3. Feature scaling: Machine learning algorithms only understand numerical relationships. Features with varying scales may therefore be incorrectly interpreted. Scaling ensures that the values within continuous features are all on the same scale.
  4. Binning: Continuous variables with many infrequently occurring values can contain a lot of noise which might lead to overfitting during training. Binning aggregates these values into buckets or groups of similar values resulting in a new categorical feature.

This tutorial has given an introductory overview of the most common preprocessing techniques applied to data for machine learning. The methods described here have many different options and there are more possible preprocessing steps.

Once you have grasped an overall understanding of the techniques described and if you would like to dive deeper into more techniques the book ‘Hands-on Machine Learning with Scikit-learn and Tensorflow’ is a great resource. This book is available as a free-to-read PDF via this link.

For more articles on Scikit-learn please see my earlier posts below.

Thanks for reading!

Citation

Autos dataset: Jeffrey, C. Schlimmer. UCI Network Data Repository [https://archive.ics.uci.edu/ml/datasets/Automobile]. Used under the Open Science License.


Understand how and when to apply common preprocessing techniques

Photo by Markus Krisetya on Unsplash

Machine learning algorithms learn patterns that exist in a set of features and use these patterns to predict a given target variable for new unknown data. The resulting trained model is essentially a mathematical function that successfully maps the values of X (the features) to the unknown value of y (the target).

As with all mathematical computations, machine learning algorithms can only work with data represented as numbers. Additionally, as each algorithm works under a variety of different constraints and assumptions, it is important that these numbers are represented in a way that reflects how the algorithm understands the data.

If we imagine we have a feature representing the colour of a car with values of red, blue and grey. If we were to represent each colour as a number, say red = 1, blue = 2, or grey = 3, the machine learning algorithm, with no understanding of the concept of colour, may interpret the colour red as being more important because it is represented by the largest number.

Preprocessing, in machine learning terms, refers to the transformation of raw features into data that a machine learning algorithm can understand and learn from.

As illustrated preprocessing data for machine learning is something of an art form and requires careful consideration of the raw data in order to select the correct strategies and preprocessing techniques.

In the following tutorial, I will give an introduction to common preprocessing steps with code examples predominately using the Scikit-learn library. This article is not meant to be an exhaustive overview of all available preprocessing methods rather it is designed to give a good foundational knowledge of the most commonly used strategies. I have included links towards the end of the article to dive deeper into preprocessing should this article peak your interest.

For the purposes of this tutorial, I will be using the ‘autos’ dataset taken from openml.org. This dataset consists of a number of features relating to the characteristics of a car and a categorical target variable representing its associated insurance risk. The dataset can also be downloaded and transformed into a pandas dataframe using the code below.

Before embarking on preprocessing it is important to get an understanding of the data types for each column. If we run df.dtypes we can see that the dataset has a mixture of both categorical and numerical data types.

Data types for the autos dataset. Image by Author

As mentioned at the beginning of the article machine learning algorithms require numerical data. As a result, any categorical features must first be transformed into numerical features before being used for model training.

The most common technique used to treat categorical variables is known as one hot encoding, sometimes also referred to as dummy encoding. This technique creates a new column for each unique value contained in the feature. The new columns are binary features containing a 0 if the value is not present and a 1 if it is.

The Scikit-learn library provides a preprocessing method that performs one hot encoding. The following code transforms the categorical features in the dataset into one hot encoded columns.

When transforming categorical columns using this method special attention must be paid to the cardinality of the feature.

Cardinality refers to the number of unique values in a given column. If, for example, we have a feature with 50 unique values. Performing one hot encoding would result in 50 columns being created.

This could lead to two possible problems;

  1. A very large training set leading to significantly longer training times.
  2. A sparse training set that can lead to problems with overfitting.

We can get an understanding of the cardinality of the features in our dataset by running the following, df[categorical_cols].nunique().

We can see that the make column has a reasonably high cardinality.

Cardinality for categorical features. Image by Author

One way to treat high cardinality categories is to aggregate the infrequently occurring values into a new category. The OneHotEncoder method provides two options for this.

The first option is to set the min_frequency argument to a chosen number. This will result in any values with a smaller frequency than this value being added to the infrequent category.

Please note this option is currently only available with Scikit-learn versions 1.1.0 and above.

The second option is to set the max_categories argument to any number greater than 1. This limits the number of columns produced to that number or fewer.

The majority of real-world datasets will have some missing values. This could be for a number of reasons. The system generating the data could have errored leading to missing observations, or a value may be missing because it is not relevant for a particular sample.

Whatever the reason the majority of machine learning algorithms cannot interpret null values and it is, therefore, necessary to treat these values in some way.

One option can be to delete the rows that contain missing values. However, this is often not practical as it can either reduce the size of the training dataset too much or the application of the algorithm may require predictions to be generated for all rows.

If dropping the missing values is not an option it will be necessary to replace them with a sensible value. This is a technique known as imputation.

There are numerous strategies for imputing missing values. Ranging from the very simple option of substituting missing values with the median, mean or most frequent for the feature. To the more complex where machine learning algorithms are used to determine the optimal value for imputation.

Before selecting a strategy we first need to understand if our dataset has any missing values. To do this run the following.

We can see from the results that 5 features have missing values and that the percentage of missing values is low (under 2%) for all except the ‘normalized-losses’ column.

Percentage of missing value. Image by Author.

Ordinarily, we would perform some exploratory analysis for each feature to inform the selection of the strategy for imputation. However, for the purposes of this tutorial, I will simply show an example of using a simple strategy and a more complex strategy.

The code shown below uses the Scikit-learn method known as SimpleImputer. As we have a mixture of categorical and numerical features with missing values we will use two different simple strategies to impute them. For numerical features, we will substitute missing values with the mean for that column, and for categorical features, we will use the most frequently occurring value.

Simply filling all missing values with a simple statistic such as the mean may not result in the most optimal performance when the data is used for training. A more complex method for imputation is to use a machine learning algorithm to inform the value to impute.

A commonly used technique for this is the K-Nearest Neighbours algorithm. This model uses a distance metric, such as the Euclidean distance, to determine a specified set of nearest neighbours and imputes the mean value for those neighbours.

Numerical features in a training set can often have very different scales. For example, the feature ‘price’ has a minimum value of 5,118. Whereas the ‘compression-ratio’ has a minimum value of only 7 and a maximum value of 23. A machine learning model may incorrectly interpret the larger values in the ‘price’ feature as being more important than those within the ‘compression-ratio’ feature.

A further preprocessing step related to scaling is centering where features are transformed so that they form a normal distribution. Many machine learning algorithms make assumptions about features being normally distributed and they will not behave as expected unless the features are represented in this way.

The Scikit-learn StandardScaler method performs both centering and scaling by removing the mean and scaling each feature to unit variance. The below code performs these steps.

Binning or discretization is a technique used to convert continuous variables into groups or buckets of similar values. This technique is particularly useful when a variable has a large number of infrequently occurring values. When this is the case discretization can reduce the noise in a feature and reduce the risk of the model overfitting during training.

In our dataset, the price variable has a very large spread of values. The most frequently occurring price only has 2 occurrences. This is an example of a feature that would particularly benefit from binning.

Once discretization has been performed the feature must then be treated as categorical and so an additional preprocessing step, such as one hot encoding must be performed.

The Scikit-learn library has a method called KBinsDiscretizer which performs both binning and categorical encoding in one step. The following code transforms the price feature into 6 bins and then performs one hot encoding on the new categorical variable. The result is a sparse matrix.

So far throughout this tutorial, we have performed all preprocessing steps independently. In a real machine learning application we will always need to apply preprocessing to both the training set, and any test or validation datasets and then apply this again during inference to new data. It is therefore most efficient to write code that can perform all of these transformations in one step.

Scikit-learn has a useful tool known as pipelines. Scikit-learn pipelines enable preprocessing steps to be chained together along with an estimator. The code below creates a pipeline that performs all of the preprocessing steps outlined in this tutorial and also fits a Random Forest classifier.

The pipeline can be re-reused to preprocess the test dataset and generate predictions as shown below.

Machine learning algorithms do not learn the same way that humans do. An algorithm is incapable of understanding the relationship that the number of doors has to a car in the same way that you and I do. In order for the machine to learn the data has to be transformed into a representation that fits how the algorithm learns.

In this article, we have covered the following preprocessing techniques. Here is a brief summary of the methods and the reasons why they are useful.

  1. Encoding categorical features: The majority of machine learning algorithms can only work with numerical data therefore categorical variables must be converted to a numerical representation.
  2. Imputing missing values: Most machine learning algorithms cannot interpret null values. Imputing replaces missing values with a sensible replacement.
  3. Feature scaling: Machine learning algorithms only understand numerical relationships. Features with varying scales may therefore be incorrectly interpreted. Scaling ensures that the values within continuous features are all on the same scale.
  4. Binning: Continuous variables with many infrequently occurring values can contain a lot of noise which might lead to overfitting during training. Binning aggregates these values into buckets or groups of similar values resulting in a new categorical feature.

This tutorial has given an introductory overview of the most common preprocessing techniques applied to data for machine learning. The methods described here have many different options and there are more possible preprocessing steps.

Once you have grasped an overall understanding of the techniques described and if you would like to dive deeper into more techniques the book ‘Hands-on Machine Learning with Scikit-learn and Tensorflow’ is a great resource. This book is available as a free-to-read PDF via this link.

For more articles on Scikit-learn please see my earlier posts below.

Thanks for reading!

Citation

Autos dataset: Jeffrey, C. Schlimmer. UCI Network Data Repository [https://archive.ics.uci.edu/ml/datasets/Automobile]. Used under the Open Science License.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment