Techno Blender
Digitally Yours.

Measuring The Speed of New Pandas 2.0 Against Polars and Datatable — Still Not Good Enough | by Bex T. | Mar, 2023

0 48


Image by author via Midjourney

People have been complaining about Pandas’ speed ever since they tried reading their first gigabyte-sized dataset with read_csv and realized they had to wait for – gasp – five seconds. And yes, I was one of those complainers.

Five seconds might not sound a lot, but when loading the dataset itself takes that much runtime, it usually means subsequent operations will take as long. And since speed is one of the most essential things in quick, dirty data exploration, you can get very frustrated.

For this reason, folks at PyData recently announced the planned release of Pandas 2.0 with the freshly minted PyArrow backend. For those totally unaware, PyArrow, on its own, is a nifty little library designed for high-performance, memory-efficient manipulation of arrays.

People sincerely hope the new backend will bring considerable speed-ups over the vanilla Pandas. This article will test that glimmer of hope by comparing the PyArrow backend against two of the fastest DataFrame libraries: Datatable and Polars.

Haven’t people already done this?

What is the point of doing this benchmark when H20 currently runs the popular Database-like Ops Benchmark that measures the computation speed of almost 15 libraries on three data manipulation operations over three different dataset sizes? My benchmark couldn’t possibly be as complete.

Well, for one, the benchmark didn’t include Pandas with the PyArrow backend and was last updated in 2021, which was ages ago.

Secondly, the benchmark was run on a monster of a machine with 40 CPU cores hopped up on 128 GB RAM and 20 GB GPU to boot (cuDF, anyone?). The general populace doesn’t usually have access to such machines, so it is important to see the differences between the libraries on everyday devices like mine. It features a modest CPU with a dozen cores and 32 gigs of RAM.

Lastly, I advocate for total transparency in the process, so I will explain the benchmark code in detail and present it as a GitHub Gist to run on your own machine.

Installation and setup

We start by installing the RC (release candidate) of Pandas 2.0 along with the latest versions of PyArrow, Datatable, and Polars.

pip install -U "pandas==2.0.0rc0" pyarrow datatable polars
import datatable as dt
import pandas as pd
import polars as pl
dt.__version__
'1.0.0'
pd.__version__
'2.0.0rc0'
pl.__version__
'0.16.14'

I created a synthetic dataset with NumPy and Faker libraries to simulate typical features in a census dataset and saved it in CSV and Parquet formats. Here are the paths to the files:

from pathlib import Path

data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

Check out this GitHub gist to see the code that generated the data.

There are 50 million rows of seven features, clocking up the file size to about 2.5 GBs.

Benchmark results

Before showing the code, let’s see the good stuff — the benchmark results:

R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by author

Right off the bat, we can see that PyArrow Pandas comes in last (or second to last in groupby) across all categories.

Please, don’t mistake the nonexistent bars in reading and writing parquet categories for 0 runtimes. Those operations aren’t supported in Datatable.

In other categories, Datatable and Polars share the top spot, with Polars having a slight edge.

Writing to CSVs has always been a slow process for Pandas, and I guess a new backend isn’t enough to change that.

Should you switch?

So, time for the million-dollar question — should you switch to the faster Polars or Datatable?

And the answer is the I-so-hate “it depends.” Are you willing to sacrifice Pandas’ almost two-decade maturity and, let’s admit it, stupidly easy and familiar syntax for superior speed?

In that case, keep in mind that the time you spend learning the syntax of a new library may balance out its performance gains.

But, if all you do is work with massive datasets, learning either of these fast libraries may be well worth the effort in the long run.

If you decide to stick with Pandas, give the Enhancing Performance page of the Pandas user guide a thorough, attentive read. It outlines some tips and tricks to add extra fuel to the Pandas engine without resorting to third-party libraries.

Also, if you are stuck with a large CSV file and still want to use Pandas, you should memorize the following code snippet:

import datatable as dt
import pandas as pd

df = dt.fread("data.csv").to_pandas()

It reads the file with the speed of Datatable, and the conversion to a Pandas DataFrame is almost instantaneous.

Benchmark code

OK, let’s finally see the code.

The first thing to do after importing the libraries is to define a DataFrame to store the benchmark results. This will make things much easier during plotting:

import time

import datatable as dt
import pandas as pd
import polars as pl

# Define a DataFrame to store the results
results_df = pd.DataFrame(
columns=["Function", "Library", "Runtime (s)"]
)

It has three columns, one for the task name, another for the library name, and another for storing the runtime.

Then, we define a timer decorator that performs the following tasks:

  1. Measures the runtime of the decorated function.
  2. Extracts the function’s name and the value of its library parameter.
  3. Stores the runtime, function name, and library name into the passed results DataFrame.
def timer(results: pd.DataFrame):
"""
A decorator to measure the runtime of the passed function.
It stores the runtime, the function name, and the passed
function's "library" parameter into the `results` DataFrame
as a single row.
"""

The idea is to define a single general function like read_csv that reads CSV files with either of the three libraries, which can be controlled with a parameter like library:

# Task 1: Reading CSVs
@timer(results_df)
def read_csv(path, library):
if library == "pandas":
return pd.read_csv(path, engine="pyarrow")
elif library == "polars":
return pl.read_csv(path)
elif library == "datatable":
return dt.fread(str(path))

Notice how we are decorating the function with timer(results_df).

We define functions for the rest of the tasks in a similar way (see the function bodies from the Gist):

# Task 2: Writing to CSVs
@timer(results_df)
def write_to_csv(df, path, library):
...

# Task 3: Reading to Parquet
@timer(results_df)
def read_parquet(path, library):
...

# Task 4: Writing to Parquet
@timer(results_df)
def write_to_parquet(df, path, library):
...

# Task 5: Sort
@timer(results_df)
def sort(df, column, library):
...

# Task 6: Groupby
@timer(results_df)
def groupby(df, library):
...

Then, we run the functions for each of the libraries:

from pathlib import Path

# Define the file paths
data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

# libraries = ["pandas", "polars", "datatable"]
l = "datatable"

# Task 3/4
df = read_parquet(data_parquet, library=l)
write_to_parquet(df, data_parquet, library=l)

# Task 1/2
df = read_csv(data_csv, library=l)
write_to_csv(df, data_csv, library=l)

# Task 5/6
sort(df, "age", library=l)
groupby(df, library=l)

To escape memory errors, I avoided loops and ran the benchmark in a Jupyter Notebook three times, changing the l variable.

Then, we create the figure of the benchmark with the following simple bar chart in lovely Seaborn:

g = sns.catplot(
data=results_df,
kind="bar",
x="Function",
y="Runtime (s)",
hue="Library",
)
R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by author

Things are changing

For years now, Pandas have stood on the shoulders of NumPy as it boomed in popularity. NumPy was kind enough to lend its features for fast computations and array manipulations.

But this approach was limited because of NumPy’s terrible support for text and missing values. Pandas couldn’t use native Python data types like lists and dictionaries because that would be a laughing stock on a massive scale.

So, Pandas has been moving away from NumPy on the sly for a few years now. For example, it introduced PyArrow datatypes for strings in 2020 already. It has been using extensions written in other languages, such as C++ and Rust, for other complex data types like dates with time zones or categoricals.

Now, Pandas 2.0 has a fully-fledged backend to support all data types with Apache Arrow’s PyArrow implementation. Apart from the apparent speed improvements, it provides much better support for missing values, interoperability, and a wider range of data types.

So, even though the backend will still be slower than other DataFrame libraries, I am eagerly awaiting its official release. Thank you for reading!

Here are a few pages to learn more about Pandas 2.0 and the PyArrow backend:


Image by author via Midjourney

People have been complaining about Pandas’ speed ever since they tried reading their first gigabyte-sized dataset with read_csv and realized they had to wait for – gasp – five seconds. And yes, I was one of those complainers.

Five seconds might not sound a lot, but when loading the dataset itself takes that much runtime, it usually means subsequent operations will take as long. And since speed is one of the most essential things in quick, dirty data exploration, you can get very frustrated.

For this reason, folks at PyData recently announced the planned release of Pandas 2.0 with the freshly minted PyArrow backend. For those totally unaware, PyArrow, on its own, is a nifty little library designed for high-performance, memory-efficient manipulation of arrays.

People sincerely hope the new backend will bring considerable speed-ups over the vanilla Pandas. This article will test that glimmer of hope by comparing the PyArrow backend against two of the fastest DataFrame libraries: Datatable and Polars.

Haven’t people already done this?

What is the point of doing this benchmark when H20 currently runs the popular Database-like Ops Benchmark that measures the computation speed of almost 15 libraries on three data manipulation operations over three different dataset sizes? My benchmark couldn’t possibly be as complete.

Well, for one, the benchmark didn’t include Pandas with the PyArrow backend and was last updated in 2021, which was ages ago.

Secondly, the benchmark was run on a monster of a machine with 40 CPU cores hopped up on 128 GB RAM and 20 GB GPU to boot (cuDF, anyone?). The general populace doesn’t usually have access to such machines, so it is important to see the differences between the libraries on everyday devices like mine. It features a modest CPU with a dozen cores and 32 gigs of RAM.

Lastly, I advocate for total transparency in the process, so I will explain the benchmark code in detail and present it as a GitHub Gist to run on your own machine.

Installation and setup

We start by installing the RC (release candidate) of Pandas 2.0 along with the latest versions of PyArrow, Datatable, and Polars.

pip install -U "pandas==2.0.0rc0" pyarrow datatable polars
import datatable as dt
import pandas as pd
import polars as pl
dt.__version__
'1.0.0'
pd.__version__
'2.0.0rc0'
pl.__version__
'0.16.14'

I created a synthetic dataset with NumPy and Faker libraries to simulate typical features in a census dataset and saved it in CSV and Parquet formats. Here are the paths to the files:

from pathlib import Path

data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

Check out this GitHub gist to see the code that generated the data.

There are 50 million rows of seven features, clocking up the file size to about 2.5 GBs.

Benchmark results

Before showing the code, let’s see the good stuff — the benchmark results:

R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by author

Right off the bat, we can see that PyArrow Pandas comes in last (or second to last in groupby) across all categories.

Please, don’t mistake the nonexistent bars in reading and writing parquet categories for 0 runtimes. Those operations aren’t supported in Datatable.

In other categories, Datatable and Polars share the top spot, with Polars having a slight edge.

Writing to CSVs has always been a slow process for Pandas, and I guess a new backend isn’t enough to change that.

Should you switch?

So, time for the million-dollar question — should you switch to the faster Polars or Datatable?

And the answer is the I-so-hate “it depends.” Are you willing to sacrifice Pandas’ almost two-decade maturity and, let’s admit it, stupidly easy and familiar syntax for superior speed?

In that case, keep in mind that the time you spend learning the syntax of a new library may balance out its performance gains.

But, if all you do is work with massive datasets, learning either of these fast libraries may be well worth the effort in the long run.

If you decide to stick with Pandas, give the Enhancing Performance page of the Pandas user guide a thorough, attentive read. It outlines some tips and tricks to add extra fuel to the Pandas engine without resorting to third-party libraries.

Also, if you are stuck with a large CSV file and still want to use Pandas, you should memorize the following code snippet:

import datatable as dt
import pandas as pd

df = dt.fread("data.csv").to_pandas()

It reads the file with the speed of Datatable, and the conversion to a Pandas DataFrame is almost instantaneous.

Benchmark code

OK, let’s finally see the code.

The first thing to do after importing the libraries is to define a DataFrame to store the benchmark results. This will make things much easier during plotting:

import time

import datatable as dt
import pandas as pd
import polars as pl

# Define a DataFrame to store the results
results_df = pd.DataFrame(
columns=["Function", "Library", "Runtime (s)"]
)

It has three columns, one for the task name, another for the library name, and another for storing the runtime.

Then, we define a timer decorator that performs the following tasks:

  1. Measures the runtime of the decorated function.
  2. Extracts the function’s name and the value of its library parameter.
  3. Stores the runtime, function name, and library name into the passed results DataFrame.
def timer(results: pd.DataFrame):
"""
A decorator to measure the runtime of the passed function.
It stores the runtime, the function name, and the passed
function's "library" parameter into the `results` DataFrame
as a single row.
"""

The idea is to define a single general function like read_csv that reads CSV files with either of the three libraries, which can be controlled with a parameter like library:

# Task 1: Reading CSVs
@timer(results_df)
def read_csv(path, library):
if library == "pandas":
return pd.read_csv(path, engine="pyarrow")
elif library == "polars":
return pl.read_csv(path)
elif library == "datatable":
return dt.fread(str(path))

Notice how we are decorating the function with timer(results_df).

We define functions for the rest of the tasks in a similar way (see the function bodies from the Gist):

# Task 2: Writing to CSVs
@timer(results_df)
def write_to_csv(df, path, library):
...

# Task 3: Reading to Parquet
@timer(results_df)
def read_parquet(path, library):
...

# Task 4: Writing to Parquet
@timer(results_df)
def write_to_parquet(df, path, library):
...

# Task 5: Sort
@timer(results_df)
def sort(df, column, library):
...

# Task 6: Groupby
@timer(results_df)
def groupby(df, library):
...

Then, we run the functions for each of the libraries:

from pathlib import Path

# Define the file paths
data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

# libraries = ["pandas", "polars", "datatable"]
l = "datatable"

# Task 3/4
df = read_parquet(data_parquet, library=l)
write_to_parquet(df, data_parquet, library=l)

# Task 1/2
df = read_csv(data_csv, library=l)
write_to_csv(df, data_csv, library=l)

# Task 5/6
sort(df, "age", library=l)
groupby(df, library=l)

To escape memory errors, I avoided loops and ran the benchmark in a Jupyter Notebook three times, changing the l variable.

Then, we create the figure of the benchmark with the following simple bar chart in lovely Seaborn:

g = sns.catplot(
data=results_df,
kind="bar",
x="Function",
y="Runtime (s)",
hue="Library",
)
R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by author

Things are changing

For years now, Pandas have stood on the shoulders of NumPy as it boomed in popularity. NumPy was kind enough to lend its features for fast computations and array manipulations.

But this approach was limited because of NumPy’s terrible support for text and missing values. Pandas couldn’t use native Python data types like lists and dictionaries because that would be a laughing stock on a massive scale.

So, Pandas has been moving away from NumPy on the sly for a few years now. For example, it introduced PyArrow datatypes for strings in 2020 already. It has been using extensions written in other languages, such as C++ and Rust, for other complex data types like dates with time zones or categoricals.

Now, Pandas 2.0 has a fully-fledged backend to support all data types with Apache Arrow’s PyArrow implementation. Apart from the apparent speed improvements, it provides much better support for missing values, interoperability, and a wider range of data types.

So, even though the backend will still be slower than other DataFrame libraries, I am eagerly awaiting its official release. Thank you for reading!

Here are a few pages to learn more about Pandas 2.0 and the PyArrow backend:

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment