How to Quickly Design Advanced Sklearn Pipelines

By Jessie Hobb On Nov 3, 2022

Tutorial

Compose all the components from Scikit-Learn Pipelines to build custom production-ready models

This tutorial will teach you how and when to use all the advanced tools from the Sklearn Pipelines ecosystem to build custom, scalable, and modular machine learning models that can easily be deployed in production.

In isolation, there is much content about different components from the Sklearn Pipelines toolbox. I am writing this tutorial because it is precious to see how all those components work together into a single, more complex system.

I will use a concrete example and show you how and when to use the following components:

Knowing how to use them individually is easy, that is why this tutorial will emphasize on when to use them and how to use them interchangeably in a complex system.

Goal

We will build a forecasting model to predict the following year’s global mean wheat yield.

The main focus will be on the advanced concepts of the Sklearn Pipeline components. Therefore, we won’t spend much time on other data science principles.

Dataset
Summary of Pipeline Fundamentals
Configuration
Data Preparation
Building the Pipeline
Global Pipeline. Let’s Put Things Together.
How to Use the Global Pipeline

NOTE: If you are interested only in the advanced topics of Sklearn Pipelines, skip directly to Building the Pipeline.

We are using a publicly available dataset [1] provided by Pangaea, which tracks global historical yearly yields for various plants from 1981 to 2016.

We found the dataset using this GitHub Repository. Check it out for more awesome publicly available datasets.

The datasets provide multiple types of crops, but for this example, we will use only wheat.

Graph of the yearly global wheat yield and the number of recorded locations. — Global yearly wheat yield and the number of locations provided within the Pangaea dataset [Image by the Author].

Here is a short reminder of the main principles used by the Sklearn Pipelines ecosystem.

Everything is revolved around the Pipeline object.

A Pipeline contains multiple Estimators.

An Estimator can have the following properties:

learns from the data → using the fit() method
transforms the data → using the transform() method. Also known as a Transformer (no, not the robots, it is a subclass of an Estimator).
predicts from new data → using the predict() method. Also known as a Predictor.

NOTE 1: We can have Transformers that do not have a fit() method. Therefore, those classes are not parameterized and follow the principles of a pure function. Usually, those types of transformers are helpful when doing feature engineering (e.g., we can multiply two different columns without learning anything before using the fit() method).

NOTE 2: The Pipeline object inherits the methods from the last Estimator within the Pipeline.

NOTE 3: If you want to add a model to your Pipeline, it must be the last element. I will show you a trick on how to perform postprocessing operations on the model’s predictions using TransformedTargetRegressor.

Render Pipelines as Diagrams

By setting the display of the Sklearn configuration to “diagram,” we can quickly visualize the Pipeline as diagrams.

Example of a visualization of a Sklearn Pipeline diagram. — Example of a visualization of a Sklearn Pipeline diagram [Image by the Author].

Constants

Below we will define a few constants that we will use across the code.

Pick Ground Truth

Because time series forecasting is a form of unsupervised learning, we have to predict a data point at Tₙ using information from the past. Therefore, in the beginning, we will take the features and labels as the same time series. But during the preprocessing steps, we will use the features as past data points and the label as the data point we want to predict.

This is not a time series forecasting post. Therefore, don’t overthink this step. Also, you don’t have to read the code line by line.

Just focus on the big picture and on the Pipelines steps. It is enough to understand the end goal of how and when to use specific Sklearn components.

X, y = yields.copy(), yields.copy()

Split Data: Train & Test

Now we will split the data between train and testing. You will see how easy it is to use your model on new data using Sklearn Pipelines correctly.

Let’s see how the train-test splits look:

(Starting year of the split, Ending year of the split, number of years within the split)X_train.index.min(), X_train.index.max(), X_train.index.max() - X_train.index.min() + 1(1981, 2012, 32)y_train.index.min(), y_train.index.max(), y_train.index.max() - y_train.index.min() + 1(1986, 2012, 27)X_test.index.min(), X_test.index.max(), X_test.index.max() - X_test.index.min() + 1(2007, 2016, 10)y_test.index.min(), y_test.index.max(), y_test.index.max() - y_test.index.min() + 1(2012, 2016, 5)

Enough talking. Let’s start implementing the actual Pipeline.

The global Pipeline is divided into the following subcomponents:

stationarity pipeline (used both on the features and targets)
feature engineering pipeline
regressor pipeline
target pipeline

1. Stationarity Pipeline

The Pipeline is used on a time series to make it stationary. More concretely, it will remove periodicity and standardize the mean and variance across time. Here you can read more about this.

In this step, we will show you how to use the following:

We can build a pipeline estimator in two ways:

1️⃣ By inheriting from BaseEstimator + TransformerMixin. Using this approach, the pipeline unit can learn from the data, transform it, and reverse the transformation. Here is a short description of the supported interface:

fit(X, y) — used to learn from the data
transform(X) — used to transform the data
fit_transform (X) — learn and transform the data. This function is inherited from TransformerMixin
inverse_transform(X) — used to reverse the transformation

Note 1: This statement is always true: x == inverse_transform(transform(x)) — with a small tolerance accepted.

Note 2: The targets (e.g., y) are passed only in the fit() method. As for the transform() and inverse_transform(), only the features (e.g., X) are given as input. Also, we can’t force the Pipeline to access any other attributes.

2️⃣ Writing a pure function that is ultimately wrapped by FunctionTransformer.

This approach is practical when the transformation does not require to have a state (it doesn’t have a fit() method), and it doesn’t need to inverse the transformation (it doesn’t have an inverse_transform() method). Therefore, it is useful when you need to implement just the transform() method.

This must be a pure function (for input A, you always get output B → it doesn’t depend on the external context). Otherwise, you will encounter strange behavior when your Pipeline gets bigger.

Note: As we implement only a transformation, we have access only to the features (e.g., X). We can’t access any other attributes.

As an observation, we chose to write a class for the transformation even though it didn’t need to implement the fit() method (.e.g., LogTransformer). But it is good software practice to pack the transformation and its inverse into the same structure.

We leveraged the partial function from the Python functools module to configure the transformations.

As stated earlier, the function given to FunctionTransformer should take only one input and output only one value. Also, it isn’t a class to configure it within the constructor. Therefore, using partial, we can set only a subset of the parameters of a function. It will wrap up our initial function and return another function that will need as input only the parameters not specified in partial at the next call.

Finally, let’s build the Pipeline. We have used the make_pipeline() utility function that automatically names every pipeline step.

Diagram of the stationary_pipeline. — Diagram of the stationary_pipeline [Image by the Author].

Here is how you can quickly check that your transformation and inverse_transformation are performing correctly:

Using np.allclose(), we check the equality by accepting a small error.

A graph of the time series after is processed by the stationarity pipeline. — Example of how the time series are looking after being processed by the stationary Pipeline [Image by the Author].

2. Feature Engineering Pipeline

Now let’s do some feature engineering.

Note that you don’t have to understand the implementation of every function. Focus on the structure.

For most of the transformations, we used pure functions + FunctionTransformer. We used this approach because we are not interested in implementing fit() or inverse_transform(). Therefore, using this method, our code is slimmer and cleaner.

Only DropRowsTransformer is implemented with a class because it needs both fit() and inverse_transform().

From my experience, when implementing data & feature engineering pipelines, I usually find FunctionTransfomers more useful and cleaner. I don’t think it is good practice to inherit classes and leave most methods empty.

Now let’s get to the sweet part, where we will use:

Using make_column_transformer, we can run different operations/pipelines on subsets of the columns. In this concrete example, we ran different transformations on the “mean_yield” and the “locations” columns. Another sweet thing about this operator is that it runs the operations for every set of columns in parallel. Therefore, the features for “mean_yield” and “locations” are computed in parallel.

Using make_union, we can compute multiple features in parallel. In this example, for “mean_yield,” we are calculating in parallel four different features:

past observations
moving average
moving standard deviation
moving median

NOTE 1: The same principle applies to the “locations” feature.

NOTE 2: I recommend using the make_[x] utility function. It will make your life easier.

In the snippet below, you can see those components in action. Following the Sklearn Pipelines paradigm, look how nicely we reused most of the functionality across the Pipeline.

Another essential element is the memory= “cache” attribute. Using this, all the steps are cached on the local disk. Therefore, if your output is cached on new runs, it will automatically read the results from the cache. It also knows when to invalidate the cache if something is changing.

Now, with minimum effort, by running all the transformations in parallel and caching the outputs of your pipeline, your machine learning pipeline will be blazing fast.

Diagram of the feature_engineering_pipeline. Note that in the notebook, the diagram is interactive. You have a dropdown showing you more details about every pipeline unit [Image by the Author].

Let’s run the feature_engineering_pipeline on our training features:

feature_engineered_yields = feature_engineering_pipeline.fit_transform(X_train.copy())

Train features computed by the feature_engineering_pipeline. — Train features computed by the feature_engineering_pipeline [Image by Author].

Now let’s run it on our testing features:

feature_engineering_pipeline.transform(X_test.copy())

3. Regressor Pipeline

Below you can see how to build the final regressor. Besides the feature_engineer_pipeline presented above, we stacked a scaler and a Radom Forest model.

You can observe that the model is the last unit added to the pipeline.

4. Target Pipeline

The target_pipeline is used to preprocess the labels and postprocess the model’s predictions.

Everything will make sense in just a second.

Here we will show you how to use the TransformedTargetRegressor component.

The TransformedTargetRegressor class takes as arguments the following:

regressor: Takes as input the regressor_pipeline defined above. What we are used to using in most of the pipelines.
transformer: Takes as input the target_pipeline which will preprocess the labels the model will use as ground truth when using the fit() and transform() methods defined within the Pipeline. ALSO, when making predictions, it will postprocess the output of the model calling inverse_transform(). How awesome is that? Finally, a method to pack all the steps into a single logical unit.

Diagram of the entire pipeline. — Diagram of the entire Pipeline. As the Pipeline gets more complex, a good visualization will always be your friend [Image by the Author].

Train

Now the training step is just a one-liner.

pipeline.fit(X=X_train.copy(), y=y_train.copy())

Make Predictions

The excellent part is that to make predictions; you also have to call a one-liner. Call “pipeline.predict(X)” and that’s it. You have your predictions.

Using TransformedTargetRegressor, the predictions are already transformed back to their initial scale when calling predict. Therefore, the model/pipeline is highly compact and easy to deploy in various scenarios: batch, API, streaming, embedded, etc.

Another helpful feature is that you can quickly use GridSearch (or other techniques) on your features and model. Now you can consider different configurations from your data pipeline as hyper-parameters. Therefore, you can quickly experiment with various features with only a few lines of code.

y_pred = pipeline.predict(X_test.copy())
y_pred26    4.026584
0     4.122576
1     4.080378
2     4.174781
3     4.380293
Name: 0, dtype: float64

Test

For fun, let’s evaluate the model using the good old RMSE and MAPE metrics.

evaluate(y_test, y_pred)INFO:sklearn-pipelines:RMSE: 0.147044
INFO:sklearn-pipelines:MAPE: 0.030220

We can observe that it is doing decent work using a simple model and without any fine-tuning at all. An RMSE of ~0.13 on a scale of ~4.0 is pretty good.

But as stated a few times, this tutorial was about leveraging Sklearn Pipelines, not building an accurate model.

If you’ve gotten so far, you are fantastic. Now you know how to write professional Scikit Pipelines. Thank you for reading my article!

Using a concrete example, we showed how powerful it is to leverage Sklearn Pipelines and their entire stack of components: TransformerMixin, BaseEstimator, FunctionTransformer, ColumnTransformer, FeatureUnion, TransformedTargetRegressor.

Using this approach, we built a flexible machine learning pipeline where we can:

Easily reuse the transformations and compose them in various ways (modular code).
Write clean and scalable classes.
Write blazing-fast code that computes all the features in parallel and caches intermediate checkpoints across the Pipeline.
Directly deploy the model as a simple class without further preprocessing/postprocessing steps.
Quickly perform hyper-parameter tunning on the feature/data pipeline and the model itself.

What other hidden gems do you know about Sklearn Pipelines?

You can find the full implementation of the example used within the tutorial on my GitHub. In that repository, I will constantly push all the examples I will use to write various articles.

[1] Iizumi, Toshichika, Global dataset of historical yields v1.2 and v1.3 aligned version. PANGAEA (2019), Supplement to: Iizumi, Toshichika; Sakai, T, The global dataset of historical yields for major crops 1981–2016 (2020), Scientific Data

If you enjoyed reading my article, you could follow me on LinkedIn for weekly insights about ML, MLOps, and freelancing.