Why Pandas-like Interfaces are Sub-optimal for Distributed Computing | by Kevin Kho | Jun, 2022

By Jessie Hobb On Jun 8, 2022

A deep look at the assumptions of the Pandas interface

This is a written version of our most recent PyCon talk.

Over the last year and a half, we’ve talked to data practitioners who want to move Pandas code to either Dask or Spark to take advantage of distributed computing resources. Their workloads were quickly becoming too compute-intense or their datasets would not fit in Pandas anymore, which only runs on a single machine.

One of the recurring themes in our conversations was tools like Koalas (renamed to PySpark Pandas) and Modin that aim to use the same Pandas interface to bring workloads to Dask, Ray, or Spark just by changing the import statement (for the most part).

For example, the PySpark Pandas drop-in replacement would be:

# import pandas as pd
import pyspark.pandas as pd

and supposedly, everything should run on Spark. There are already some blogs that show this isn’t entirely true (as of May 2022). There are some hiccups here and there, but we’re not here to talk about slight discrepancies. This post is about fundamental differences that will always exist because of the nuances of distributed computing that Pandas isn’t compatible with.

Pandas-like frameworks are popular because a lot of data scientists are resistant to change (I’ve been there myself!). But just changing the import statement allows users to avoid understanding what is really happening in the distributed system and the lack of understanding leads to ineffective usage.

We’ll see that the attempt to achieve 1:1 parity with the Pandas API will require compromises on performance and functionality.

We created a DataFrame with the following structure. Columns a and b are string columns. Columns c and d are numerical values. This DataFrame will have 1 million rows (but we will also change it in some cases).

We will create this DataFrame in Pandas, Modin (on Ray), PySpark Pandas, and Dask. For each backend, we will time the operations of different cases. This should be clearer after the first issue is discussed.

One of the most used Pandas methods is iloc . This relies on an implicit global ordering of data. This is why Pandas can quickly retrieve the rows in a given set of index values. It knows where to access the memory of the row it needs to retrieve.

Take the following 5 cases in the code snippet below, we’ll evaluate the speed of each operation relative to Case 1. We do not compare across frameworks. We want to see the different performance profiles of each framework. Cases 3–5 below are accessing rows and columns based on location. Case 5 specifically is the middle of the DataFrame. We will run these five cases on Pandas, Modin, PySpark Pandas (also known as Koalas), and Dask.

Image by author — comparison of data access

A deep look at the assumptions of the Pandas interface

Written by Kevin Kho and Han Wang

This is a written version of our most recent PyCon talk.

For example, the PySpark Pandas drop-in replacement would be:

# import pandas as pd
import pyspark.pandas as pd

We’ll see that the attempt to achieve 1:1 parity with the Pandas API will require compromises on performance and functionality.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.