Improve Your Data Preprocessing with ColumnTransformer and Pipelines | by João Pedro | Jun, 2022


Create extremely customized and organized preprocessing pipelines with Sklearn’s ColumnTransformer

Photo by Simon Kadula on Unsplash

Data preprocessing is probably one of the most time-consuming steps in a Machine Learning/Data Science Pipeline.

In most realistic scenarios, the available raw data is unformatted, dirty, and improper to a machine learning model/data analysis, requiring several steps of cleaning and feature engineering.

In the context of structured data (tables), a developer needs to deal with all sorts of problems, like missing values, denormalized data, unformatted strings, duplicated rows, etc.

They also need to improve data representation by normalizing numerical features, embedding categorical features, creating more meaningful columns, and many other steps to increase the ML model performance or improve the quality of dashboards.

This need to implement specific rules to data frames can easily lead to the creation of spaghetti code, which is hard to maintain and update and is error-prone.

Data Preprocessing. Image by Author. Icons by Freepik.

Sklearn’s Pipelines with ColumnTransformer is an easy way to apply transformation rules in a standard manner, creating a more organized and clean code.

If you’re already familiar with the ColumnTransformer module from sklearn, you can skip this section.

When dealing with tabular data, it’s common to perform several cleaning steps on different data columns.

For example, a numerical feature “price” may require an operation to replace the NULL values with the data mean. As you probably already know, Sklearn provides a transformer to make this, the SimpleImputer.

What a ColumnTransformer allows is to apply a Sklearn’s Transformer only in a group of columns.

Knowing ColumnTransformer. Image by Author. Icons by Freepik.

Let’s see how this works on code.

The ColumnTransformer object receives a list of tuples, composed of the transformer name (this is your choice), the transformer itself, and the columns where to apply the transformation. The argument remainder specifies what needs to be done with all other columns.

The image below shows the code output.

DataFrame column transformation (1). Image by Author.

The replacement operation only occurred in the column specified, while the remained stayed untouched (as specified by remainder=“passthrough” ). The pandas DataFrame was also replaced by a Numpy Array, as this is the default behavior of Sklearn’s transformers.

Let’s see a more complex example.

In the case above, the “Price” column’s null values are replaced with the mean, the “A” column’s null values with the median and all other columns’ null values with the value -1. The image below shows the result.

DataFrame column transformation (2). Image by Author.

Hint: If you are working on a Jupyter Notebook, it is possible to display the estimators in an interactive diagram very easily, by setting display=“diagram” in Sklearn’s configuration.

The code below makes this configuration, and the image shows the visualization of the previous example.

Hopefully, you already understood the power of the ColumnTransformer class. It is a simple way to perform transformations on many columns, all at once, with all the logic encapsulated in one single object.

The ColumnTransformer is quite useful, but it is not enough. In many cases, a column needs to be processed in multiple steps.

For example, the numerical feature “price” may require an operation to replace the NULL values with the data mean, a log transformation to distribute the data more symmetrically and standardization to make its values fall closer to the interval [-1, 1].

Unfortunately, there is no transformer in sklearn to do all this work, and that’s where Pipelines come in.

With pipelines, we can chain multiple transformers to create a complex process. Because a pipeline object is equivalent to a simple transformer (e.g. it has the same .fit() and .transform() methods), it can be inserted into the ColumnTransformer object.

You can also put a ColumnTransformer inside a Pipeline, because it is also a simple transformer object, and this loop can go on as much as you need.

This is one of the beauties of Sklearn architecture: All transformers modules share the same interface, so they can easily work together.

Let’s see how this work on code.

The pipeline object has a quite intuitive interface. It accepts a list of tuples, each one representing a transformer, with a name of your choice and the transformer object itself. It applies the transformations in the specified order.

The image below shows the transformer created previously.

Column Transformer with Pipelines. Image by Author.

Let’s quickly explore how this technique can help in a “real case” by making a classifier pipeline for the Wine Classification Dataset available at Sklearn.

This dataset contains 13 numerical features about wine’s chemical properties, which are classified into 3 categories. The code below imports the data.

Importing data

Plotting the features’ distribution we can choose which treatment each one needs.

Features’ distribution. Image by Author.

Let’s suppose that the following treatments were chosen:

  • malic_acid: min-max scaling
  • magnesium: log transformation, discretize values in 4 bins
  • ash: drop feature
  • nonflavanoid_phenols: None
  • any other: standard scaler

After that, the data needs to pass by a PCA to reduce dimensionality and then reach a RandomForestClassifier.

The code below shows how this pipeline can be built.

And the pipeline looks like this:

We can then use this object as a normal classifier:

The main goal of this exercise is to show that it is possible to encapsulate all this complex logic into one single object (estimator). The final object is compatible with all other sklearn modules, which can make life much easier.

For example, it is possible to perform a Grid Search that changes hyperparameters from the preprocessing step way up to the classifier itself. Because of sklearn architecture, is quite easy to create new modules (transformers, classifiers, regressors, etc..) so, if you need a very specific task, it’s very easy to create a new class and make it compatible with pipelines. The estimator architecture is also easily serializable, so it can be stored in various formats, like JSON and XML.

Finally, the code is clean and standardized, which makes it easier to update and maintain.

Preprocessing data is a crucial step in any data science process. However, when specificities of each dataset arise, the preprocessing step can become very complex, and full of specific domain rules.

In this context, maintaining all the transformations in a single object can be very useful, as one single instance can be easily moved, stored, and updated.

This post explores how the Sklearn column transformer class can improve code quality and organization by encapsulating all the preprocessing logic in a single place while also maintaining a high standardization degree between objects.

Even though this post explores specifically the ColumnTransformer class, the logic extends to other sklearn modules, as the main goal of this post is to show how to make a more robust code. The architecture of Sklearn is very organized, and understanding more advanced capabilities can make your life much easier. So, I hope this short post helps you with that.

Thanks for reading! 🙂

References

[1] Scikit-Learn, Official Documentation on ColumnTransformer
[2] Scikit-Learn, Official Documentation on Pipelines
[3] Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. “ O’Reilly Media, Inc.”.


Create extremely customized and organized preprocessing pipelines with Sklearn’s ColumnTransformer

Photo by Simon Kadula on Unsplash

Data preprocessing is probably one of the most time-consuming steps in a Machine Learning/Data Science Pipeline.

In most realistic scenarios, the available raw data is unformatted, dirty, and improper to a machine learning model/data analysis, requiring several steps of cleaning and feature engineering.

In the context of structured data (tables), a developer needs to deal with all sorts of problems, like missing values, denormalized data, unformatted strings, duplicated rows, etc.

They also need to improve data representation by normalizing numerical features, embedding categorical features, creating more meaningful columns, and many other steps to increase the ML model performance or improve the quality of dashboards.

This need to implement specific rules to data frames can easily lead to the creation of spaghetti code, which is hard to maintain and update and is error-prone.

Data Preprocessing. Image by Author. Icons by Freepik.

Sklearn’s Pipelines with ColumnTransformer is an easy way to apply transformation rules in a standard manner, creating a more organized and clean code.

If you’re already familiar with the ColumnTransformer module from sklearn, you can skip this section.

When dealing with tabular data, it’s common to perform several cleaning steps on different data columns.

For example, a numerical feature “price” may require an operation to replace the NULL values with the data mean. As you probably already know, Sklearn provides a transformer to make this, the SimpleImputer.

What a ColumnTransformer allows is to apply a Sklearn’s Transformer only in a group of columns.

Knowing ColumnTransformer. Image by Author. Icons by Freepik.

Let’s see how this works on code.

The ColumnTransformer object receives a list of tuples, composed of the transformer name (this is your choice), the transformer itself, and the columns where to apply the transformation. The argument remainder specifies what needs to be done with all other columns.

The image below shows the code output.

DataFrame column transformation (1). Image by Author.

The replacement operation only occurred in the column specified, while the remained stayed untouched (as specified by remainder=“passthrough” ). The pandas DataFrame was also replaced by a Numpy Array, as this is the default behavior of Sklearn’s transformers.

Let’s see a more complex example.

In the case above, the “Price” column’s null values are replaced with the mean, the “A” column’s null values with the median and all other columns’ null values with the value -1. The image below shows the result.

DataFrame column transformation (2). Image by Author.

Hint: If you are working on a Jupyter Notebook, it is possible to display the estimators in an interactive diagram very easily, by setting display=“diagram” in Sklearn’s configuration.

The code below makes this configuration, and the image shows the visualization of the previous example.

Hopefully, you already understood the power of the ColumnTransformer class. It is a simple way to perform transformations on many columns, all at once, with all the logic encapsulated in one single object.

The ColumnTransformer is quite useful, but it is not enough. In many cases, a column needs to be processed in multiple steps.

For example, the numerical feature “price” may require an operation to replace the NULL values with the data mean, a log transformation to distribute the data more symmetrically and standardization to make its values fall closer to the interval [-1, 1].

Unfortunately, there is no transformer in sklearn to do all this work, and that’s where Pipelines come in.

With pipelines, we can chain multiple transformers to create a complex process. Because a pipeline object is equivalent to a simple transformer (e.g. it has the same .fit() and .transform() methods), it can be inserted into the ColumnTransformer object.

You can also put a ColumnTransformer inside a Pipeline, because it is also a simple transformer object, and this loop can go on as much as you need.

This is one of the beauties of Sklearn architecture: All transformers modules share the same interface, so they can easily work together.

Let’s see how this work on code.

The pipeline object has a quite intuitive interface. It accepts a list of tuples, each one representing a transformer, with a name of your choice and the transformer object itself. It applies the transformations in the specified order.

The image below shows the transformer created previously.

Column Transformer with Pipelines. Image by Author.

Let’s quickly explore how this technique can help in a “real case” by making a classifier pipeline for the Wine Classification Dataset available at Sklearn.

This dataset contains 13 numerical features about wine’s chemical properties, which are classified into 3 categories. The code below imports the data.

Importing data

Plotting the features’ distribution we can choose which treatment each one needs.

Features’ distribution. Image by Author.

Let’s suppose that the following treatments were chosen:

  • malic_acid: min-max scaling
  • magnesium: log transformation, discretize values in 4 bins
  • ash: drop feature
  • nonflavanoid_phenols: None
  • any other: standard scaler

After that, the data needs to pass by a PCA to reduce dimensionality and then reach a RandomForestClassifier.

The code below shows how this pipeline can be built.

And the pipeline looks like this:

We can then use this object as a normal classifier:

The main goal of this exercise is to show that it is possible to encapsulate all this complex logic into one single object (estimator). The final object is compatible with all other sklearn modules, which can make life much easier.

For example, it is possible to perform a Grid Search that changes hyperparameters from the preprocessing step way up to the classifier itself. Because of sklearn architecture, is quite easy to create new modules (transformers, classifiers, regressors, etc..) so, if you need a very specific task, it’s very easy to create a new class and make it compatible with pipelines. The estimator architecture is also easily serializable, so it can be stored in various formats, like JSON and XML.

Finally, the code is clean and standardized, which makes it easier to update and maintain.

Preprocessing data is a crucial step in any data science process. However, when specificities of each dataset arise, the preprocessing step can become very complex, and full of specific domain rules.

In this context, maintaining all the transformations in a single object can be very useful, as one single instance can be easily moved, stored, and updated.

This post explores how the Sklearn column transformer class can improve code quality and organization by encapsulating all the preprocessing logic in a single place while also maintaining a high standardization degree between objects.

Even though this post explores specifically the ColumnTransformer class, the logic extends to other sklearn modules, as the main goal of this post is to show how to make a more robust code. The architecture of Sklearn is very organized, and understanding more advanced capabilities can make your life much easier. So, I hope this short post helps you with that.

Thanks for reading! 🙂

References

[1] Scikit-Learn, Official Documentation on ColumnTransformer
[2] Scikit-Learn, Official Documentation on Pipelines
[3] Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. “ O’Reilly Media, Inc.”.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
ColumnTransformerDataimproveJoaoJunlatest newsmachine learningPedropipelinesPreprocessingTechnoblender
Comments (0)
Add Comment