Five Killer Optimization Techniques Every Pandas User Should Know | by Avi Chawla | Jul, 2022

By Jessie Hobb On Jul 18, 2022

A step towards data analysis run-time optimization

The motivation to design and build real-world applicable machine learning models has always intrigued Data Scientists to leverage optimized, efficient, and accurate methods at scale. Optimization plays a foundational role in sustainably delivering real-world and user-facing software solutions.

While I understand that not everyone is building solutions at scale, awareness about various optimization and time-saving techniques is nevertheless helpful and highly applicable to even generic Data Science/Machine Learning use-cases.

Therefore, in this post, I will introduce you to a handful of incredible techniques to reduce the run-time of your regular tabular data analysis, management, and processing tasks using Pandas. To get a brief overview, I will discuss the following topics in this post:

#1 Input/Output on CSV
#2 Filtering Based on Categorical data
#3 Merging DataFrames
#4 Value_counts() vs GroupBy()
#5 Iterating over a DataFrame

Moreover, you can get the notebook for this post here.

Let’s begin 🚀!

CSV files are by far the most prevalent source to read DataFrames from and store DataFrames to, aren’t they? This is because CSVs provide tremendous flexibility in the context of input and output operation using the pd.read_csv() and df.to_csv() method such as:

Flow chart to determine the alternative for reading CSV (Image by author).

Data filtering is another common and widely-used operation in Pandas. The core idea is to select a segment of a dataframe that adheres to a specific condition.

To demonstrate, consider a dummy DataFrame of over 4 million records I created myself. The first five rows are shown in the image below:

The first five rows of the dummy dataset (Image by author)

The code block below demonstrates my implementation:

Say you want to filter all the records which belong to “Amazon”. This can be done as follows:

Another way of doing the same filtering is by using groupby() and obtaining the individual group using the get_group() method as shown below:

The latter method provides speed-ups of up to 14 times as compared to the usual filtering method, which is a tremendous improvement in the run-time.

Moreover, the get_group() the method returns the individual group as a Pandas DataFrame. Therefore, you can proceed with the usual analysis post that. We can verify this by checking the type of dataframe obtained in Approach 1 and Approach 2 as follows:

Key Takeaways/Final Thoughts

If you will perform repeated filtering of your DataFrame on categorical data, prefer grouping the data first using the groupby() method. After that, fetch the desired groups using the get_group() method.
Caveat: This approach is only applicable to filtering based on categorical data.

Merge in Pandas refers to combining two DataFrames based on a join condition. This is similar to joins in Structured Query Language (SQL). You can execute merge using the pd.merge() method in Pandas as follows:

Although there is nothing wrong with the above method to link dataframes, there is a faster alternative available to join two dataframes using the join() method.

In the code block below, I have implemented the merge operation using the merge() method and the join() method. Here, we measure the time taken for the merge operation using the two methods.

With the join() method, we notice an improvement of over 4 times relative to the standard merge() method in Pandas.

Here, the join() method first expects you to change the index column and set it to the specific column on which you wish to execute joins between tables. This is done using the set_index() method in Pandas, as shown above.

If you want to execute a join condition on multiple columns, you can do that too using the join() method. First, pass the columns you wish to execute the join condition on as a list to the set_index() method. Then, call the join() method as before. This is demonstrated below:

Key Takeaways/Final Thoughts

While performing joins, always change the index of both the DataFrames and set it to the column(s) you want to execute the join condition on.

We use value_counts() in Pandas to find the frequency of individual elements in a series. For instance, consider the dummy employee DataFrame we used in Section 2.

We can find the number of employees belonging to each company in this dataset using the value_counts() method as follows:

Similar frequency calculation can also be done using groupby(). The code below demonstrates that:

The output of value_counts() is arranged in descending order of frequencies. On the other hand, the output of size() on groupby() is sorted on the index column, which in this case is Company Name.

Assuming we are not bothered with how the output is arranged or sorted, we can measure the difference in run-time of the two methods to obtain the desired frequency as follows:

Even though both the methods essentially do the same thing (if we ignore the order of the output for once), there is a significant run-time difference between the two — groupby() being 1.5 times slower than value_counts().

Things get even worse when you want to obtain normalized frequencies, which denote the percentage/fraction of individual elements in the series. The run-time, in this case, is compared below:

Once again, although both the methods do the same thing, there is a significant run-time difference between the two — groupby() being close to 2 times slower than value_counts().

Key Takeaways/Final Thoughts

For frequency-based measures, prefer using value_counts() instead of groupby().
value_counts() can be used on multiple columns at once. Therefore, If you want to compute frequency on a combination of values from multiple columns, do that with value_counts() instead of groupby().

Looping or iterating over a DataFrame is the process of visiting every row individually and performing some pre-defined operations on the record. Although the best thing in such cases is to avoid looping altogether in the first place and prefer vectorized approaches, there might be situations where looping is necessary.

There are three methods in Pandas through which iteration is possible. Below, we’ll discuss them and compare their run-time on the employee dummy dataset used in the sections below. To revisit, the image below shows the first five rows of the DataFrame.

The three methods to loop over a DataFrame are:

Iterate using range(len(df)).
Iterate using iterrows().
Iterate using itertuples().

I have implemented three functions in the code block below which utilize these three methods. The objective of the function is to calculate the mean salary of all employees in the DataFrame. We also find the run-time of each of these methods on the same DataFrame below.

Method 1: Iterate using `range(len(df))`

The average run-time to iterate over 4 million records is 46.1 ms.

Method 2: Iterate using iterrows()

The iterrows() method provides a substantial improvement in the iteration process, reducing the run-time by 2.5 times from 46.1 ms to 18.2 ms.

Method 3: Iterate using itertuples()

The itertuples() method turns out to be even better than iterrows(), reducing the run-time further by over 23 times from 18.2 ms to 773 µs.

Key Takeaways/Final Thoughts

First, you should avoid introducing for-loops in your code to iterate over a DataFrame. Think of a vectorized solution if possible.
If vectorization is not possible, leverage the pre-implemented methods in Pandas for iteration, such as itertuples() and iterrows().

In this post, I discussed five incredible optimization techniques in Pandas, which you can directly leverage in your next data science project. In my opinion, the areas I have discussed in this post are subtle ways to improve the run-time, which are often overlooked to seek optimization in. Nonetheless, I hope this post gave you an insightful understanding of these day-to-day Pandas’ functions.

If you enjoyed reading this post, I hope you would like the following posts too:

Thanks for reading.

A step towards data analysis run-time optimization

#1 Input/Output on CSV
#2 Filtering Based on Categorical data
#3 Merging DataFrames
#4 Value_counts() vs GroupBy()
#5 Iterating over a DataFrame

Moreover, you can get the notebook for this post here.

Let’s begin 🚀!

Data filtering is another common and widely-used operation in Pandas. The core idea is to select a segment of a dataframe that adheres to a specific condition.

To demonstrate, consider a dummy DataFrame of over 4 million records I created myself. The first five rows are shown in the image below:

The code block below demonstrates my implementation:

Say you want to filter all the records which belong to “Amazon”. This can be done as follows:

Another way of doing the same filtering is by using groupby() and obtaining the individual group using the get_group() method as shown below:

The latter method provides speed-ups of up to 14 times as compared to the usual filtering method, which is a tremendous improvement in the run-time.

Key Takeaways/Final Thoughts

If you will perform repeated filtering of your DataFrame on categorical data, prefer grouping the data first using the groupby() method. After that, fetch the desired groups using the get_group() method.
Caveat: This approach is only applicable to filtering based on categorical data.

Although there is nothing wrong with the above method to link dataframes, there is a faster alternative available to join two dataframes using the join() method.

In the code block below, I have implemented the merge operation using the merge() method and the join() method. Here, we measure the time taken for the merge operation using the two methods.

With the join() method, we notice an improvement of over 4 times relative to the standard merge() method in Pandas.

Key Takeaways/Final Thoughts

While performing joins, always change the index of both the DataFrames and set it to the column(s) you want to execute the join condition on.

We use value_counts() in Pandas to find the frequency of individual elements in a series. For instance, consider the dummy employee DataFrame we used in Section 2.

We can find the number of employees belonging to each company in this dataset using the value_counts() method as follows:

Similar frequency calculation can also be done using groupby(). The code below demonstrates that:

Assuming we are not bothered with how the output is arranged or sorted, we can measure the difference in run-time of the two methods to obtain the desired frequency as follows:

Things get even worse when you want to obtain normalized frequencies, which denote the percentage/fraction of individual elements in the series. The run-time, in this case, is compared below:

Once again, although both the methods do the same thing, there is a significant run-time difference between the two — groupby() being close to 2 times slower than value_counts().

Key Takeaways/Final Thoughts

For frequency-based measures, prefer using value_counts() instead of groupby().
value_counts() can be used on multiple columns at once. Therefore, If you want to compute frequency on a combination of values from multiple columns, do that with value_counts() instead of groupby().

The three methods to loop over a DataFrame are:

Iterate using range(len(df)).
Iterate using iterrows().
Iterate using itertuples().

Method 1: Iterate using `range(len(df))`

The average run-time to iterate over 4 million records is 46.1 ms.

Method 2: Iterate using iterrows()

The iterrows() method provides a substantial improvement in the iteration process, reducing the run-time by 2.5 times from 46.1 ms to 18.2 ms.

Method 3: Iterate using itertuples()

The itertuples() method turns out to be even better than iterrows(), reducing the run-time further by over 23 times from 18.2 ms to 773 µs.

Key Takeaways/Final Thoughts

First, you should avoid introducing for-loops in your code to iterate over a DataFrame. Think of a vectorized solution if possible.
If vectorization is not possible, leverage the pre-implemented methods in Pandas for iteration, such as itertuples() and iterrows().

If you enjoyed reading this post, I hope you would like the following posts too:

Thanks for reading.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Five Killer Optimization Techniques Every Pandas User Should Know | by Avi Chawla | Jul, 2022

A step towards data analysis run-time optimization

Path 1

Path 2

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Method 1: Iterate using `range(len(df))`

Method 2: Iterate using iterrows()

Method 3: Iterate using itertuples()

Key Takeaways/Final Thoughts

A step towards data analysis run-time optimization

Path 1

Path 2

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Method 1: Iterate using `range(len(df))`

Method 2: Iterate using iterrows()

Method 3: Iterate using itertuples()

Key Takeaways/Final Thoughts

Five Killer Optimization Techniques Every Pandas User Should Know | by Avi Chawla | Jul, 2022

A step towards data analysis run-time optimization

Path 1

Path 2

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Method 1: Iterate using range(len(df))

Method 2: Iterate using iterrows()

Method 3: Iterate using itertuples()

Key Takeaways/Final Thoughts

A step towards data analysis run-time optimization

Path 1

Path 2

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Key Takeaways/Final Thoughts

Method 1: Iterate using range(len(df))

Method 2: Iterate using iterrows()

Method 3: Iterate using itertuples()

Key Takeaways/Final Thoughts

Method 1: Iterate using `range(len(df))`

Method 1: Iterate using `range(len(df))`