Data Preprocessing Using Pipeline in Pandas | by Zoumana Keita | Sep, 2022
Write better code using Pandas Pipe function
Data in real life goes through many preprocessing phases such as quality assessment, cleaning, transformation, reduction, etc. Most of the time these steps are performed in a way that is not always efficient when using Pandas.
What if we could have an approach that can execute all the processing functions in a chain and in the most efficient manner? This is where Pandas’s Pipe
comes in handy.
In this short article, we will first understand what Pandas’ Pipe is. Then we will cover a few practice cases showing it in action.
At the end of this section, you will understand what Pipe
is, then have a better understanding of the business problem we are trying to solve, after having a good idea of the dataset.
What is Pipe?
Pipe is a Pandas function used to execute multiple functions by passing the result of the previous function to the next one in the chain. It is a good approach to make the code cleaner, and more readable.
Now that we know what we are dealing with, let’s have a look at the data set for the practice cases.
Let’s create the dataset
This dataset might speak more to hiring managers. But, worries not if you’re not one 😀. Basically, it stores information about applicants from different African big cities. As you can see below most of the columns are self-explanatory.
Full_Name
is the full name of the candidate.Degree
is the candidate’s degree during the application.From
corresponds to where the candidate is from.Application_date
is when the candidate applied for the position.From_office (min)
is the commute time (in minutes) to the company’s offices in the candidate’s local city.
Problem Statement
Based on that information, the hiring manager wants you as Data Scientist to create the following table to help him get more granular information about each candidate.
Standalone preprocessing before using Pipe
One more thing before getting our hands dirty! We need to implement for each of the tasks shown in the image, the corresponding function. The goal of this section is to perform in a standalone manner the result of each task before using Pipe to get the final result.
→ Task 1: First and Last names from Full_Name: the full name is in the following format [First Name] [comma] [Last Name]. So we can use the following function to get the job done.
From line 11, we get the following result from the execution of task 1.
→ Task 2: We want to get from the application date, the day, the month, the year, the day of the week, and the month of the year.
From line 15, we get the following result from the execution of task 2.
→ Task 2: your final task to make your hiring manager happy is to create for each candidate a piece of textual information in the following format, where [Candidate] is the full name of the candidate:
[Candidate] holds a [Degree] and lives [From_office (min)] away from the office.
Functions chaining with Pipe
Now that we know how each function works and their expected results, it is time to combine all of them into a single pipeline using Pipe. No additional libraries are needed to use Pipe because it is a Pandas built-in function, and this is the general syntax:
final_dataset = (my_original_data.
pipe(my_first_function, "single_column").
pipe(my_second_function, "[col1, ...colN]").
pipe(my_third_function).
...
pipe(my_nth_function)
)
final_dateset
is the final preprocessed dataset after applying all the functions.my_original_data
is the raw dataset.pipe(my_first_function, "single_column")
means thatmy_first_function
needssingle_column
to complete the task.pipe(my_second_function, "[col1, ..., colN]")
means thatmy_second_function
needs[col1, ..., colN]
complete the task.pipe(my_second_function,)
means that there is no need to specify the column name.
Below is the pipe logic corresponding to the previous format. Beware that the order does not matter in our case. But it could be if we needed the First Name and Last Name to compute the Info column, which is not what we are doing here because we are using the Full_Name straight away.
Congratulations! 🎉 🍾 You have just learned how to use Pandas Pipe for multiple functions chaining! I hope this article was helpful and that it will help you take your preprocessing tasks to the next level.
Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Feel free to follow me on Medium, Twitter, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!
Link to source code → Here
Bye for now 🏃🏾♂️
Write better code using Pandas Pipe function
Data in real life goes through many preprocessing phases such as quality assessment, cleaning, transformation, reduction, etc. Most of the time these steps are performed in a way that is not always efficient when using Pandas.
What if we could have an approach that can execute all the processing functions in a chain and in the most efficient manner? This is where Pandas’s Pipe
comes in handy.
In this short article, we will first understand what Pandas’ Pipe is. Then we will cover a few practice cases showing it in action.
At the end of this section, you will understand what Pipe
is, then have a better understanding of the business problem we are trying to solve, after having a good idea of the dataset.
What is Pipe?
Pipe is a Pandas function used to execute multiple functions by passing the result of the previous function to the next one in the chain. It is a good approach to make the code cleaner, and more readable.
Now that we know what we are dealing with, let’s have a look at the data set for the practice cases.
Let’s create the dataset
This dataset might speak more to hiring managers. But, worries not if you’re not one 😀. Basically, it stores information about applicants from different African big cities. As you can see below most of the columns are self-explanatory.
Full_Name
is the full name of the candidate.Degree
is the candidate’s degree during the application.From
corresponds to where the candidate is from.Application_date
is when the candidate applied for the position.From_office (min)
is the commute time (in minutes) to the company’s offices in the candidate’s local city.
Problem Statement
Based on that information, the hiring manager wants you as Data Scientist to create the following table to help him get more granular information about each candidate.
Standalone preprocessing before using Pipe
One more thing before getting our hands dirty! We need to implement for each of the tasks shown in the image, the corresponding function. The goal of this section is to perform in a standalone manner the result of each task before using Pipe to get the final result.
→ Task 1: First and Last names from Full_Name: the full name is in the following format [First Name] [comma] [Last Name]. So we can use the following function to get the job done.
From line 11, we get the following result from the execution of task 1.
→ Task 2: We want to get from the application date, the day, the month, the year, the day of the week, and the month of the year.
From line 15, we get the following result from the execution of task 2.
→ Task 2: your final task to make your hiring manager happy is to create for each candidate a piece of textual information in the following format, where [Candidate] is the full name of the candidate:
[Candidate] holds a [Degree] and lives [From_office (min)] away from the office.
Functions chaining with Pipe
Now that we know how each function works and their expected results, it is time to combine all of them into a single pipeline using Pipe. No additional libraries are needed to use Pipe because it is a Pandas built-in function, and this is the general syntax:
final_dataset = (my_original_data.
pipe(my_first_function, "single_column").
pipe(my_second_function, "[col1, ...colN]").
pipe(my_third_function).
...
pipe(my_nth_function)
)
final_dateset
is the final preprocessed dataset after applying all the functions.my_original_data
is the raw dataset.pipe(my_first_function, "single_column")
means thatmy_first_function
needssingle_column
to complete the task.pipe(my_second_function, "[col1, ..., colN]")
means thatmy_second_function
needs[col1, ..., colN]
complete the task.pipe(my_second_function,)
means that there is no need to specify the column name.
Below is the pipe logic corresponding to the previous format. Beware that the order does not matter in our case. But it could be if we needed the First Name and Last Name to compute the Info column, which is not what we are doing here because we are using the Full_Name straight away.
Congratulations! 🎉 🍾 You have just learned how to use Pandas Pipe for multiple functions chaining! I hope this article was helpful and that it will help you take your preprocessing tasks to the next level.
Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Feel free to follow me on Medium, Twitter, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!
Link to source code → Here
Bye for now 🏃🏾♂️