Never Worry About Optimization. Process GBs of Tabular Data 25x Faster With No-Code Pandas | by Avi Chawla | Nov, 2022
No more run-time and memory optimization, let’s get straight to work
Pandas makes the tasks of analyzing tabular datasets an absolute breeze. The sleek API design offers a wide range of functionalities that covers almost every tabular data use case.
However, it’s only when someone transitions towards scale that they experience the profound limitations of Pandas. I have talked about this before in the blog below:
In a gist, almost all limitations of Pandas arise from its single-core computational framework.
In other words, even if your CPU has multiple cores available (or idle), Pandas always relies on a single core, which inhibits its performance.
Moreover, Pandas DataFrames are inherently bulky. Pandas never optimizes the datatypes of DataFrame’s columns. Thus, if you have to deal with a large dataset (with size in GBs):
- You should only load a chunk of data in memory at a time, process it, discard it and load the next chunk, or,
- If, for some reason, you need the whole dataset, then you should adjust the data types to fit it into memory.
Also, in many cases, data is often split across multiple CSV files. Thus, if you want to analyze the entire data together, one has to merge the data of multiple CSV files and then process it.
There are a few more ways which I have discussed in one of my previous blogs:
Unfortunately, you are out of luck if you need the whole dataset in memory, but despite optimizing the memory usage, you are not able to load it.
Even if you somehow succeed in loading the dataset, there’s no guarantee that you will be able to process it.
As intermediate computations also require memory, this may eventually lead to the following:
Back to square one.
I am sure many of you have been there. I know it is frustrating, especially when your objective is to conduct a quick analysis, such as plotting some bar graphs, estimating group statistics, etc.
Also, honestly speaking, unless you are building a data science pipeline intended to serve an end-user, why should a Data Scientist spend hours optimizing the memory? This, in my opinion, inhibits their productivity.
I hope by this point, you understand that the problem is real.
To this end, let’s discuss a potential solution — Gigasheet, which in my opinion, is a revolutionary alternative to Pandas, not just for large datasets but for small datasets too.
I covered a detailed overview of transitioning from Pandas to “No-code Pandas” using Gigasheet in the blog below.
The focus of this blog is, however, slightly different.
In this blog, I will briefly demonstrate how challenging it is to work with large datasets in Pandas and how Gigasheet makes it a piece of cake.
Additionally, the two biggest advantages that come with a tool like Gigasheet are:
- You don’t have to write any code.
- You don’t have to worry about run-time or memory optimization.
That’s amazing. Let’s begin 🚀!
For demonstration purposes, I will use a dummy dataset created using Faker. As shown below, it takes up a little over 9 GBs of disk space.
The first five rows are shown below:
Also, as shown below, the dataset has just over 30M rows and 25 columns.
Next, let’s perform some common operations in Pandas, measure their run-time, and compare that to Gigasheet.
As Gigasheet is a no-code tool, measuring the run-time is a bit challenging. Therefore, for comparison, I shall use the duration of the screen recording uploaded here to approximate the run-time of Gigasheet.
#1 Filtering
Filtering is a common operation performed during tabular data analysis. Let’s filter the DataFrames on the Country_Code
column and select all rows where the value is MY
.
→ Pandas
→ Gigasheet
Filtering in Gigasheet took just over two seconds. Compared to Pandas, which took close to 25 seconds, Gigasheet is ~92% faster.
#2 Sorting
Next, let’s sort the DataFrame on the BBAN
column, in ascending order.
→ Pandas
Here, we can use the sort_values()
method.
→ Gigasheet
The steps are demonstrated below:
Sorting in Gigasheet took a little over 5 seconds. On the other hand, Pandas took close to 2 minutes — a performance gain of 96%.
#3 Grouping
Next, let’s group the DataFrame on the Country
column and find a couple of things:
- The number of records in each group.
- The number of unique
Zip_Codes
.
→ Pandas
Recalling what we discussed in the Introduction:
Even if you somehow succeed in loading the dataset, there’s no guarantee that you will be able to process it.
→ Gigasheet
Gigasheet computes the required aggregations without any hassle.
Yet again, Gigasheet emerges as the clear winner.
In this blog, I demonstrated how you can leverage Gigasheet and perform three typical operations on a large dataset.
While in Pandas, performing these operations did take a fair amount of time to execute. The same operations in Gigasheet, however, were highly efficient.
In fact, with Pandas, we never obtained the results for the grouping operation as it crashed the Jupyter Kernel. On the other hand, Gigasheet, doing justice to its name, handled everything with ease.
Before wrapping up, one point that I wish to add is that through this article, I am not claiming that Gigasheet is (or will be) the ultimate replacement for Pandas.
One can do many more things with Pandas or Excel, which aren’t possible with Gigasheet (yet). However, besides being no-code, its ability to easily handle large datasets makes it an absolute winner in such situations.
Also, the biggest issue with Excel is its max row limit. This inhibits working on large-scale data analytics projects, which Excel does not support.
In fact, even if you are working with moderately sized datasets, Gigasheet can come in handy there too. I wrote a detailed article recently on the transition from Pandas to Gigasheet, which you can read here:
As always, thanks for reading!
No more run-time and memory optimization, let’s get straight to work
Pandas makes the tasks of analyzing tabular datasets an absolute breeze. The sleek API design offers a wide range of functionalities that covers almost every tabular data use case.
However, it’s only when someone transitions towards scale that they experience the profound limitations of Pandas. I have talked about this before in the blog below:
In a gist, almost all limitations of Pandas arise from its single-core computational framework.
In other words, even if your CPU has multiple cores available (or idle), Pandas always relies on a single core, which inhibits its performance.
Moreover, Pandas DataFrames are inherently bulky. Pandas never optimizes the datatypes of DataFrame’s columns. Thus, if you have to deal with a large dataset (with size in GBs):
- You should only load a chunk of data in memory at a time, process it, discard it and load the next chunk, or,
- If, for some reason, you need the whole dataset, then you should adjust the data types to fit it into memory.
Also, in many cases, data is often split across multiple CSV files. Thus, if you want to analyze the entire data together, one has to merge the data of multiple CSV files and then process it.
There are a few more ways which I have discussed in one of my previous blogs:
Unfortunately, you are out of luck if you need the whole dataset in memory, but despite optimizing the memory usage, you are not able to load it.
Even if you somehow succeed in loading the dataset, there’s no guarantee that you will be able to process it.
As intermediate computations also require memory, this may eventually lead to the following:
Back to square one.
I am sure many of you have been there. I know it is frustrating, especially when your objective is to conduct a quick analysis, such as plotting some bar graphs, estimating group statistics, etc.
Also, honestly speaking, unless you are building a data science pipeline intended to serve an end-user, why should a Data Scientist spend hours optimizing the memory? This, in my opinion, inhibits their productivity.
I hope by this point, you understand that the problem is real.
To this end, let’s discuss a potential solution — Gigasheet, which in my opinion, is a revolutionary alternative to Pandas, not just for large datasets but for small datasets too.
I covered a detailed overview of transitioning from Pandas to “No-code Pandas” using Gigasheet in the blog below.
The focus of this blog is, however, slightly different.
In this blog, I will briefly demonstrate how challenging it is to work with large datasets in Pandas and how Gigasheet makes it a piece of cake.
Additionally, the two biggest advantages that come with a tool like Gigasheet are:
- You don’t have to write any code.
- You don’t have to worry about run-time or memory optimization.
That’s amazing. Let’s begin 🚀!
For demonstration purposes, I will use a dummy dataset created using Faker. As shown below, it takes up a little over 9 GBs of disk space.
The first five rows are shown below:
Also, as shown below, the dataset has just over 30M rows and 25 columns.
Next, let’s perform some common operations in Pandas, measure their run-time, and compare that to Gigasheet.
As Gigasheet is a no-code tool, measuring the run-time is a bit challenging. Therefore, for comparison, I shall use the duration of the screen recording uploaded here to approximate the run-time of Gigasheet.
#1 Filtering
Filtering is a common operation performed during tabular data analysis. Let’s filter the DataFrames on the Country_Code
column and select all rows where the value is MY
.
→ Pandas
→ Gigasheet
Filtering in Gigasheet took just over two seconds. Compared to Pandas, which took close to 25 seconds, Gigasheet is ~92% faster.
#2 Sorting
Next, let’s sort the DataFrame on the BBAN
column, in ascending order.
→ Pandas
Here, we can use the sort_values()
method.
→ Gigasheet
The steps are demonstrated below:
Sorting in Gigasheet took a little over 5 seconds. On the other hand, Pandas took close to 2 minutes — a performance gain of 96%.
#3 Grouping
Next, let’s group the DataFrame on the Country
column and find a couple of things:
- The number of records in each group.
- The number of unique
Zip_Codes
.
→ Pandas
Recalling what we discussed in the Introduction:
Even if you somehow succeed in loading the dataset, there’s no guarantee that you will be able to process it.
→ Gigasheet
Gigasheet computes the required aggregations without any hassle.
Yet again, Gigasheet emerges as the clear winner.
In this blog, I demonstrated how you can leverage Gigasheet and perform three typical operations on a large dataset.
While in Pandas, performing these operations did take a fair amount of time to execute. The same operations in Gigasheet, however, were highly efficient.
In fact, with Pandas, we never obtained the results for the grouping operation as it crashed the Jupyter Kernel. On the other hand, Gigasheet, doing justice to its name, handled everything with ease.
Before wrapping up, one point that I wish to add is that through this article, I am not claiming that Gigasheet is (or will be) the ultimate replacement for Pandas.
One can do many more things with Pandas or Excel, which aren’t possible with Gigasheet (yet). However, besides being no-code, its ability to easily handle large datasets makes it an absolute winner in such situations.
Also, the biggest issue with Excel is its max row limit. This inhibits working on large-scale data analytics projects, which Excel does not support.
In fact, even if you are working with moderately sized datasets, Gigasheet can come in handy there too. I wrote a detailed article recently on the transition from Pandas to Gigasheet, which you can read here:
As always, thanks for reading!