Pipelines in Scikit-Learn: An Amazing Way to Bundle Transformations | by Eirik Berge, PhD | Apr, 2023

By Jessie Hobb On Apr 5, 2023

One of the most popular Python libraries for dealing with machine learning tasks is scikit-learn. It went public in 2010 and has since been essential for implementing popular supervised ML algorithms like logistic regression, random forests, and support vector machines.

When writing code in scikit-learn, you can use a feature called pipelines. This feature allows you to bundle up several of the steps in the machine learning process into a single component. The use of pipelines is one of the single most determining factors for whether scikit-learn code is easy to work with. It’s frustrating how many neglect pipelines when creating machine learning models in scikit-learn 😞

In this blog post, you will learn the advantages of scikit-learn pipelines. After reading this, you should feel confident in applying pipelines to your own machine learning projects. Let’s jump in 👍

The goal of pipelines is to encapsulate several steps in a machine learning project into a single manageable piece. To illustrate this, let us start with the following setup code:

from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split# Create data
X, y = make_classification(random_state=42)
# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

As the imports in the code block above suggest, we are going to scale the data and then use a random forest model for classification. Without pipelines, this would look something like this:

# Scale the data
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)# Train the random forest
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train_scaled, y_train)
# Predict with the random forest
random_forest.predict(X_test_scaled)

There are both serious and minor problems with the code above! Let’s mention some of them:

Data Leakage: If you look carefully at the code above, then you can see that it actually leaks information about the training data to the testing data. Specifically, when scaler transforms X_test, then it uses the minimum and maximum values of X_train to do this. Hence information about the training set is revealed in the testing set. This can make the accuracy of the random forest on the testing set a bit optimistic!
Intermediate Variable Names: We created the intermediate variable names X_train_scaled and X_test_scaled. This is only necessary because we have the scaling and the training as completely separate processes.
Less Explicit Function Calls: In the code above, several of the lines are intermediate .fit() and .transform() function calls. This clutters the code and makes it less readable.
Hyperparameter Search: We want to do a hyperparameter search for both the feature_range parameter in the MinMaxScaler and the n_estimators parameter in the RandomForestClassifier. We would need to do these searches completely separately! Not only is it more cumbersome, but we then greedily optimize each of them separately, rather than looking for a combined optimum. This can make us miss the best solution.

It should not come as a surprise that pipelines are what will save the day. Compare the above code with the following code snippet:

from sklearn.pipeline import Pipeline# Create a pipeline that combines the scaling and training
pipe = Pipeline([
('scaler', MinMaxScaler(feature_range=(0, 1))), 
('forest', RandomForestClassifier(n_estimators=10))
])
# Fit the pipeline to the training data
pipe.fit(X_train, y_train)
# Predict with the random forest
pipe.score(X_test, y_test)

The code with the pipeline is shorter, sweeter, and avoids all the problems listed above:

Data Leakage: Pipelines automatically ensure that there is no data leakage happening!
Intermediate Variable Names: You don’t need any intermediate variable names like X_train_scaled and X_test_scaled anymore!
Less Explicit Function Calls: With the pipeline, you only need a single .fit() function call to execute the whole sequence. This makes the code easier to read!
Hyperparameter Search: Once you have used a pipeline, you can do a hyperparameter search on all of the components at once. The method .get_params() is useful for getting the parameter names for all the transformers/estimators in the pipeline. This is nicely explained in the blog post Integrate Pipeline into Scikit-Learn’s Hyperparameter Search!

After fitting a pipeline in scikit-learn, there are certain attributes that will make your life a lot easier. I have been guilty of neglecting these and have paid the price😅

The first is the attribute named_steps:

# Gives us the components of the pipeline
print(pipe.named_steps)# Output:
{'scaler': MinMaxScaler(), 'forest': RandomForestClassifier(n_estimators=10)}
# Can now access each of them
print(pipe.named_steps["forest"])
# Output:
RandomForestClassifier(n_estimators=10)

Whenever you have a composite object (like pipelines in scikit-learn), it is useful to know how to access the individual components.

Another useful attribute is .n_features_in_. This will show you how many features were passed into the first implicit .fit()method in the pipeline (in our case, the .fit() method for the MinMaxScaler):

# Gives us the number of features passed into the pipeline
print(pipe.n_features_in_)# Output:
20

Finally, you can also use the utility function make_pipeline()to create pipelines in scikit-learn. The difference is that make_pipeline automatically gives names to the different transformers/estimators:

from sklearn.pipeline import make_pipeline# Automatically assign names to the components
pipe = make_pipeline(MinMaxScaler(), RandomForestClassifier(n_estimators=10))
print(pipe)
# Output:
Pipeline(
steps=[
('minmaxscaler', MinMaxScaler()),
('randomforestclassifier',RandomForestClassifier(n_estimators=10))
]
)

I personally prefer to use the utility function make_pipeline as then I don’t need to come up with names myself. If many developers are working on different pipelines, then make_pipeline ensures consistency.

If you want to learn more about nested parameters and caching when it comes to pipelines, then check out the pipeline user guide for more information.

Hopefully, you now understand how and why you should use pipelines when writing machine learning code in Scikit-Learn. If you are interested in data science, programming, or anything in between, then feel free to add me on LinkedIn and say hi ✋

Like my writing? Check out some of my other posts for more Python content: