Techno Blender
Digitally Yours.

Python Pandas to Polars: Data Filtering | by Soner Yıldırım | Apr, 2023

0 39


Photo by Daphné Be Frenchie on Unsplash

I admire Pandas. I have been using it since the first day I started learning data science. Pandas has been more than enough for most of my tasks in data cleaning, preprocessing, and analysis.

The only issue I have with pandas is when working with large datasets. Pandas does in-memory analytics so its performance starts to go down when the data size becomes very large.

Another downside associated with data size is that some operations make intermediate copies. Hence, the dataset should be relatively smaller than the memory to be able to work efficiently.

There are different alternatives to Pandas for such large datasets. One of the alternatives that has gained significant popularity recently is Polars.

There are dozens of articles that focus on the speed of polars compared to pandas but not much on the practical side to explain how to perform common data cleaning and manipulation operations with Polars.

In this series of articles, I will show you Polars versions of frequently used Pandas functions. The first topic is data filtering operations. Before we start on doing the examples, let’s briefly mention what polars has to offer.

Polars is a DataFrame library for Rust and Python.

  • Polars utilizes all available cores on your machine whereas pandas uses a single CPU core to execute the operations.
  • Polars is relatively lightweight than pandas and has no dependencies, which makes it quite faster to import polars. It takes 70 ms to import polars whereas it takes 520 ms for pandas.
  • Polars does query optimization to reduce unnecessary memory allocations. It is also able to process queries partially or entirely in a streaming fashion. As a result, polars can handle datasets that are larger than the available RAM in your machine.

We will go through several examples to learn how to filter polars DataFrames. We will also see the pandas versions of the same operations to make the transition from pandas to polars easier.

Let’s first create a DataFrame to work on. We will be using a sample dataset that I prepared with mock data. You can download it from my datasets repo.

# pandas
import pandas as pd

# read csv
df_pd = pd.read_csv("datasets/sales_data_with_stores.csv")

# display the first 5 rows
df_pd.head()

The first 5 rows of the pandas DataFrame (image by author)
# polars
import polars as pl

# read_csv
df_pl = pl.read_csv("datasets/sales_data_with_stores.csv")

# display the first 5 rows
df_pl.head()

The first 5 rows of the polars DataFrame (image by author)

Both pandas and polars have the same functions to read a csv file and display the first 5 rows of the DataFrame. Polars also shows the data types of the columns and shape of the output, which I think is an informative add-on.

Example 1: Filter by numeric values

Let’s filter rows in which the price is higher than 750.

# pandas
df_pd[df_pd["cost"] > 750]

# polars
df_pl.filter(pl.col("cost") > 750)

I will only show the output of pandas or polars versions since they are the same.

(image by author)

Example 2: Multiple conditions

Both pandas and polars support filtering by multiple conditions. We can combine the conditions with “and” and “or” logic.

Let’s filter rows with a price of more than 750 and a store value of Violet.

# pandas
df_pd[(df_pd["cost"] > 750) & (df_pd["store"] == "Violet")]

# polars
df_pl.filter((pl.col("cost") > 750) & (pl.col("store") == "Violet"))

(image by author)

Example 3: The isin method

The isin method of pandas can be used for comparing the row value to a list of values. It is quite useful when the condition consists of multiple values. The polars version of this method is “is_in”.

We can select the rows for product groups PG1, PG2, and PG3 as follows:

# pandas
df_pd[df_pd["product_group"].isin(["PG1", "PG2", "PG5"])]

# polars
df_pl.filter(pl.col("product_group").is_in(["PG1", "PG2", "PG5"]))

The first 5 rows of the output:

(image by author)

Example 4: Select a subset of columns

To select a subset of columns, we can pass a list of columns to both pandas and polars DataFrames as follows:

cols = ["product_code", "cost", "price"]

# pandas (both of the following do the job)
df_pd[cols]
df_pd.loc[:, cols]

# polars
df_pl.select(pl.col(cols))

The first 5 rows of the output:

(image by author)

Example 5: Select a subset of rows

We can use the loc or iloc methods to select a subset of rows for pandas. In polars, we use a very similar approach.

Here is a simple example that selects the rows between 10th and 20th:

# pandas
df_pd.iloc[10:20]

# polars
df_pl[10:20]

To select the same rows but only the first three columns:

# pandas
df_pd.iloc[10:20, :3]

# polars
df_pl[10:20, :3]

If we want to select columns by names, we can use the loc method in pandas.

# pandas
df_pd.loc[10:20, ["store", "product_group", "price"]]

# polars
df_pl[10:20, ["store", "product_group", "price"]]

Example 6: Select columns by data type

We can also select columns that are of a particular data type. Let’s do an example that selects columns with 64 bit integer (i.e. int64) data type.

# pandas
df_pd.select_dtypes(include="int64")

# polars
df_pl.select(pl.col(pl.Int64))

The first 5 rows of the output:

(image by author)

We have done several examples to compare filtering operations between pandas and polars. In general, polars is quite similar to pandas but follows a Spark SQL-like approach in some cases. If you are familiar with data cleaning and manipulation with Spark SQL, you will realize the similarities.

With that being said, considering the efficiency of polar when working with large datasets, it may soon become a strong candidate to replace pandas in data cleaning and manipulation tasks.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.


Photo by Daphné Be Frenchie on Unsplash

I admire Pandas. I have been using it since the first day I started learning data science. Pandas has been more than enough for most of my tasks in data cleaning, preprocessing, and analysis.

The only issue I have with pandas is when working with large datasets. Pandas does in-memory analytics so its performance starts to go down when the data size becomes very large.

Another downside associated with data size is that some operations make intermediate copies. Hence, the dataset should be relatively smaller than the memory to be able to work efficiently.

There are different alternatives to Pandas for such large datasets. One of the alternatives that has gained significant popularity recently is Polars.

There are dozens of articles that focus on the speed of polars compared to pandas but not much on the practical side to explain how to perform common data cleaning and manipulation operations with Polars.

In this series of articles, I will show you Polars versions of frequently used Pandas functions. The first topic is data filtering operations. Before we start on doing the examples, let’s briefly mention what polars has to offer.

Polars is a DataFrame library for Rust and Python.

  • Polars utilizes all available cores on your machine whereas pandas uses a single CPU core to execute the operations.
  • Polars is relatively lightweight than pandas and has no dependencies, which makes it quite faster to import polars. It takes 70 ms to import polars whereas it takes 520 ms for pandas.
  • Polars does query optimization to reduce unnecessary memory allocations. It is also able to process queries partially or entirely in a streaming fashion. As a result, polars can handle datasets that are larger than the available RAM in your machine.

We will go through several examples to learn how to filter polars DataFrames. We will also see the pandas versions of the same operations to make the transition from pandas to polars easier.

Let’s first create a DataFrame to work on. We will be using a sample dataset that I prepared with mock data. You can download it from my datasets repo.

# pandas
import pandas as pd

# read csv
df_pd = pd.read_csv("datasets/sales_data_with_stores.csv")

# display the first 5 rows
df_pd.head()

The first 5 rows of the pandas DataFrame (image by author)
# polars
import polars as pl

# read_csv
df_pl = pl.read_csv("datasets/sales_data_with_stores.csv")

# display the first 5 rows
df_pl.head()

The first 5 rows of the polars DataFrame (image by author)

Both pandas and polars have the same functions to read a csv file and display the first 5 rows of the DataFrame. Polars also shows the data types of the columns and shape of the output, which I think is an informative add-on.

Example 1: Filter by numeric values

Let’s filter rows in which the price is higher than 750.

# pandas
df_pd[df_pd["cost"] > 750]

# polars
df_pl.filter(pl.col("cost") > 750)

I will only show the output of pandas or polars versions since they are the same.

(image by author)

Example 2: Multiple conditions

Both pandas and polars support filtering by multiple conditions. We can combine the conditions with “and” and “or” logic.

Let’s filter rows with a price of more than 750 and a store value of Violet.

# pandas
df_pd[(df_pd["cost"] > 750) & (df_pd["store"] == "Violet")]

# polars
df_pl.filter((pl.col("cost") > 750) & (pl.col("store") == "Violet"))

(image by author)

Example 3: The isin method

The isin method of pandas can be used for comparing the row value to a list of values. It is quite useful when the condition consists of multiple values. The polars version of this method is “is_in”.

We can select the rows for product groups PG1, PG2, and PG3 as follows:

# pandas
df_pd[df_pd["product_group"].isin(["PG1", "PG2", "PG5"])]

# polars
df_pl.filter(pl.col("product_group").is_in(["PG1", "PG2", "PG5"]))

The first 5 rows of the output:

(image by author)

Example 4: Select a subset of columns

To select a subset of columns, we can pass a list of columns to both pandas and polars DataFrames as follows:

cols = ["product_code", "cost", "price"]

# pandas (both of the following do the job)
df_pd[cols]
df_pd.loc[:, cols]

# polars
df_pl.select(pl.col(cols))

The first 5 rows of the output:

(image by author)

Example 5: Select a subset of rows

We can use the loc or iloc methods to select a subset of rows for pandas. In polars, we use a very similar approach.

Here is a simple example that selects the rows between 10th and 20th:

# pandas
df_pd.iloc[10:20]

# polars
df_pl[10:20]

To select the same rows but only the first three columns:

# pandas
df_pd.iloc[10:20, :3]

# polars
df_pl[10:20, :3]

If we want to select columns by names, we can use the loc method in pandas.

# pandas
df_pd.loc[10:20, ["store", "product_group", "price"]]

# polars
df_pl[10:20, ["store", "product_group", "price"]]

Example 6: Select columns by data type

We can also select columns that are of a particular data type. Let’s do an example that selects columns with 64 bit integer (i.e. int64) data type.

# pandas
df_pd.select_dtypes(include="int64")

# polars
df_pl.select(pl.col(pl.Int64))

The first 5 rows of the output:

(image by author)

We have done several examples to compare filtering operations between pandas and polars. In general, polars is quite similar to pandas but follows a Spark SQL-like approach in some cases. If you are familiar with data cleaning and manipulation with Spark SQL, you will realize the similarities.

With that being said, considering the efficiency of polar when working with large datasets, it may soon become a strong candidate to replace pandas in data cleaning and manipulation tasks.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment