Data Preprocessing Using Pipeline in Pandas | by Zoumana Keita | Sep, 2022

By Jessie Hobb On Sep 9, 2022

Write better code using Pandas Pipe function

Data in real life goes through many preprocessing phases such as quality assessment, cleaning, transformation, reduction, etc. Most of the time these steps are performed in a way that is not always efficient when using Pandas.

What if we could have an approach that can execute all the processing functions in a chain and in the most efficient manner? This is where Pandas’s Pipe comes in handy.

In this short article, we will first understand what Pandas’ Pipe is. Then we will cover a few practice cases showing it in action.

At the end of this section, you will understand what Pipe is, then have a better understanding of the business problem we are trying to solve, after having a good idea of the dataset.

What is Pipe?

Pipe is a Pandas function used to execute multiple functions by passing the result of the previous function to the next one in the chain. It is a good approach to make the code cleaner, and more readable.

Now that we know what we are dealing with, let’s have a look at the data set for the practice cases.

Let’s create the dataset

This dataset might speak more to hiring managers. But, worries not if you’re not one 😀. Basically, it stores information about applicants from different African big cities. As you can see below most of the columns are self-explanatory.

pipe_candidates.py

Original raw information about candidates (Image by Author)

Full_Name is the full name of the candidate.
Degree is the candidate’s degree during the application.
From corresponds to where the candidate is from.
Application_date is when the candidate applied for the position.
From_office (min)is the commute time (in minutes) to the company’s offices in the candidate’s local city.

Problem Statement

Based on that information, the hiring manager wants you as Data Scientist to create the following table to help him get more granular information about each candidate.

Final expected table from based on Business requirements (Image by Author)

Standalone preprocessing before using Pipe

One more thing before getting our hands dirty! We need to implement for each of the tasks shown in the image, the corresponding function. The goal of this section is to perform in a standalone manner the result of each task before using Pipe to get the final result.

→ Task 1: First and Last names from Full_Name: the full name is in the following format [First Name] [comma] [Last Name]. So we can use the following function to get the job done.

result_task1.py

From line 11, we get the following result from the execution of task 1.

Task 1: From Full Name to First and Last names (Image by Author)

→ Task 2: We want to get from the application date, the day, the month, the year, the day of the week, and the month of the year.

result_task2.py

From line 15, we get the following result from the execution of task 2.

Task 2: From Application date to more granular time information (Image by Author)

→ Task 2: your final task to make your hiring manager happy is to create for each candidate a piece of textual information in the following format, where [Candidate] is the full name of the candidate:

[Candidate] holds a [Degree] and lives [From_office (min)] away from the office.

result_task3.py

Task 2: Generation of candidates’ information from all the columns (Image by Author)

Functions chaining with Pipe

Now that we know how each function works and their expected results, it is time to combine all of them into a single pipeline using Pipe. No additional libraries are needed to use Pipe because it is a Pandas built-in function, and this is the general syntax:

final_dataset = (my_original_data.
pipe(my_first_function, "single_column").
pipe(my_second_function, "[col1, ...colN]").
pipe(my_third_function).
...
pipe(my_nth_function)
)

final_dateset is the final preprocessed dataset after applying all the functions.
my_original_data is the raw dataset.
pipe(my_first_function, "single_column") means that my_first_function needs single_columnto complete the task.
pipe(my_second_function, "[col1, ..., colN]") means that my_second_function needs [col1, ..., colN] complete the task.
pipe(my_second_function,) means that there is no need to specify the column name.

Below is the pipe logic corresponding to the previous format. Beware that the order does not matter in our case. But it could be if we needed the First Name and Last Name to compute the Info column, which is not what we are doing here because we are using the Full_Name straight away.

create_pipe.py

Final result combining all the tasks from 1 to 3 using Pipe (Image by Author)

Congratulations! 🎉 🍾 You have just learned how to use Pandas Pipe for multiple functions chaining! I hope this article was helpful and that it will help you take your preprocessing tasks to the next level.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

Link to source code → Here

Bye for now 🏃🏾‍♂️