Techno Blender
Digitally Yours.

Using DuckDB with Polars. Learn how to use SQL to query your… | by Wei-Meng Lee | Apr, 2023

0 66


Photo by Hans-Jurgen Mager on Unsplash

In my previous few articles on data analytics, I talk about two important up-and-coming libraries that are currently gaining a lot of tractions in the industry:

  • DuckDB — where you can query your dataset in-memory using SQL statements.
  • Polars — a much more efficient DataFrame library compared to the venerable Pandas library.

What about combining the power of these two libraries?

In fact, you can directly query a Polars dataframe through DuckDB, using SQL statements.

So what are the benefits of querying your Polars dataframe using SQL? Despite the ease of use, manipulating Polars dataframes still require a bit of practise and a relatively steep learning curve. But since most developers are already familiar with SQL, isn’t it more convenient to manipulate the dataframes directly using SQL? Using this approach, developers have the best of both worlds:

  • the ability to query Polars dataframes using all the various functions, or
  • use SQL for cases where it is much more natural and easier to extract the data that they want

In this article, I will give you some examples of how you can make use of SQL through DuckDB to query your Polars dataframes.

For this article, I am using Jupyter Notebook. Ensure that you have installed Polars and DuckDB using the following commands:

!pip install polars
!pip install duckdb

To get started, let’s create a Polars DataFrame by hand:

import polars as pl

df = pl.DataFrame(
{
'Model': ['iPhone X','iPhone XS','iPhone 12',
'iPhone 13','Samsung S11',
'Samsung S12','Mi A1','Mi A2'],
'Sales': [80,170,130,205,400,30,14,8],
'Company': ['Apple','Apple','Apple','Apple',
'Samsung','Samsung','Xiao Mi',
'Xiao Mi'],
})
df

Here’s how the dataframe looks:

All images by author

Say, you now want to find all phones from Apple which has sales of more than 80. You can use the filter() function in Polars, like this:

df.filter(
(pl.col('Company') == 'Apple') &
(pl.col('Sales') > 80)
)

And the result looks like this:

Let’s now do the exact query that we did in the previous section, except that this time round we will use DuckDB with a SQL statement. But first, let’s select all the rows in the dataframe:

import duckdb

result = duckdb.sql('SELECT * FROM df')
result

You can directly reference the df dataframe from your SQL statement.

Using DuckDB, you issue a SQL statement using the sql() function. Alternatively, the query() function also works:

result = duckdb.query('SELECT * FROM df')

The result variable is a duckdb.DuckDBPyRelation object. Using this object, you can perform quite a number of different tasks, such as:

  • Getting the mean of the Sales column:
result.mean('Sales')
  • Describing the dataframe:
result.describe()
  • Applying a scaler function to the columns in the dataframe:
result.apply("max", 'Sales,Company')
  • Reordering the dataframe:
result.order('Sales DESC')

But the easiest way is to query the Polars DataFrame is to use SQL directly.

For example, if you want to get all the rows with sales greater than 80, simply use the sql() function with the SQL statement below:

duckdb.sql('SELECT * FROM df WHERE Sales >80').pl()

The pl() function converts the duckdb.DuckDBPyRelation object to a Polars DataFrame. If you want to convert it to a Pandas DataFrame instead, use the df() function.

If you want to get all the rows whose model name starts with “iPhone”, then use the following SQL statement:

duckdb.sql("SELECT * FROM df WHERE Model LIKE 'iPhone%'").pl()

If you want all devices from Apple and Xiao Mi, then use the following SQL statement:

duckdb.sql("SELECT * FROM df WHERE Company = 'Apple' OR Company ='Xiao Mi'").pl()

The real power of using DuckDB with Polars DataFrame is when you want to query from multiple dataframes. Consider the following three CSV files from the 2015 Flights Delay dataset:

2015 Flights Delay datasethttps://www.kaggle.com/datasets/usdot/flight-delays. Licensing — CC0: Public Domain

  • flights.csv
  • airlines.csv
  • airports.csv

Let’s load them up using Polars:

import polars as pl

df_flights = pl.scan_csv('flights.csv')
df_airlines = pl.scan_csv('airlines.csv')
df_airports = pl.scan_csv('airports.csv')

display(df_flights.collect().head())
display(df_airlines.collect().head())
display(df_airports.collect().head())

The above statements use lazy evaluation to load up the three CSV files. This ensures that any queries on the dataframes are not performed until all the queries are optimized. The collect() function forces Polars to load the CSV files into dataframes.

Here is how the df_flights, df_airlines, and df_airports dataframes look like:

Suppose you want to count the number of times an airline has a delay , and at the same time display the name of each airline, here is the SQL statement that you can use using the df_airlines and df_flights dataframes:

duckdb.sql('''
SELECT
count(df_airlines.AIRLINE) as Count,
df_airlines.AIRLINE
FROM df_flights, df_airlines
WHERE df_airlines.IATA_CODE = df_flights.AIRLINE AND df_flights.ARRIVAL_DELAY > 0
GROUP BY df_airlines.AIRLINE
ORDER BY COUNT DESC
''')

And here is the result:

If you want to count the number of airports in each state and sort the count in descending order, you can use the following SQL statement:

duckdb.sql('''
SELECT STATE, Count(*) as AIRPORT_COUNT
FROM df_airports
GROUP BY STATE
ORDER BY AIRPORT_COUNT DESC
''')

Finally, suppose you want to know which airline has the highest average delay. You can use the following SQL statement to calculate the various statistics, such as minimum arrival delay, maximum array delay, mean arrival delay, and standard deviation of arrival delay:

duckdb.sql('''
SELECT AIRLINE, MIN(ARRIVAL_DELAY), MAX(ARRIVAL_DELAY),
MEAN(ARRIVAL_DELAY), stddev(ARRIVAL_DELAY)
FROM df_flights
GROUP BY AIRLINE
ORDER BY MEAN(ARRIVAL_DELAY)
''')

Based on the mean arrival delay, we can see that the AS airline is the one with the shortest delay (as the value is negative, this means most of the time it arrives earlier!) and NK airline is the one with the longest delay. Want to know what is the AS airline? Try it out using what you have just learned! I will leave it as an exercise and the answer is at the end of this article.

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

In this short article, I illustrated how DuckDB and Polars can be used together to query your dataframes. Utilizing both libraries gives you the best of both worlds — using a familiar querying language (which is SQL) to query an efficient dataframe. Go ahead and try it out using your own dataset and share with us how it has helped your data analytics processes.

Answer to quiz:

duckdb.sql("SELECT AIRLINE from df_airlines WHERE IATA_CODE = 'AS'")


Photo by Hans-Jurgen Mager on Unsplash

In my previous few articles on data analytics, I talk about two important up-and-coming libraries that are currently gaining a lot of tractions in the industry:

  • DuckDB — where you can query your dataset in-memory using SQL statements.
  • Polars — a much more efficient DataFrame library compared to the venerable Pandas library.

What about combining the power of these two libraries?

In fact, you can directly query a Polars dataframe through DuckDB, using SQL statements.

So what are the benefits of querying your Polars dataframe using SQL? Despite the ease of use, manipulating Polars dataframes still require a bit of practise and a relatively steep learning curve. But since most developers are already familiar with SQL, isn’t it more convenient to manipulate the dataframes directly using SQL? Using this approach, developers have the best of both worlds:

  • the ability to query Polars dataframes using all the various functions, or
  • use SQL for cases where it is much more natural and easier to extract the data that they want

In this article, I will give you some examples of how you can make use of SQL through DuckDB to query your Polars dataframes.

For this article, I am using Jupyter Notebook. Ensure that you have installed Polars and DuckDB using the following commands:

!pip install polars
!pip install duckdb

To get started, let’s create a Polars DataFrame by hand:

import polars as pl

df = pl.DataFrame(
{
'Model': ['iPhone X','iPhone XS','iPhone 12',
'iPhone 13','Samsung S11',
'Samsung S12','Mi A1','Mi A2'],
'Sales': [80,170,130,205,400,30,14,8],
'Company': ['Apple','Apple','Apple','Apple',
'Samsung','Samsung','Xiao Mi',
'Xiao Mi'],
})
df

Here’s how the dataframe looks:

All images by author

Say, you now want to find all phones from Apple which has sales of more than 80. You can use the filter() function in Polars, like this:

df.filter(
(pl.col('Company') == 'Apple') &
(pl.col('Sales') > 80)
)

And the result looks like this:

Let’s now do the exact query that we did in the previous section, except that this time round we will use DuckDB with a SQL statement. But first, let’s select all the rows in the dataframe:

import duckdb

result = duckdb.sql('SELECT * FROM df')
result

You can directly reference the df dataframe from your SQL statement.

Using DuckDB, you issue a SQL statement using the sql() function. Alternatively, the query() function also works:

result = duckdb.query('SELECT * FROM df')

The result variable is a duckdb.DuckDBPyRelation object. Using this object, you can perform quite a number of different tasks, such as:

  • Getting the mean of the Sales column:
result.mean('Sales')
  • Describing the dataframe:
result.describe()
  • Applying a scaler function to the columns in the dataframe:
result.apply("max", 'Sales,Company')
  • Reordering the dataframe:
result.order('Sales DESC')

But the easiest way is to query the Polars DataFrame is to use SQL directly.

For example, if you want to get all the rows with sales greater than 80, simply use the sql() function with the SQL statement below:

duckdb.sql('SELECT * FROM df WHERE Sales >80').pl()

The pl() function converts the duckdb.DuckDBPyRelation object to a Polars DataFrame. If you want to convert it to a Pandas DataFrame instead, use the df() function.

If you want to get all the rows whose model name starts with “iPhone”, then use the following SQL statement:

duckdb.sql("SELECT * FROM df WHERE Model LIKE 'iPhone%'").pl()

If you want all devices from Apple and Xiao Mi, then use the following SQL statement:

duckdb.sql("SELECT * FROM df WHERE Company = 'Apple' OR Company ='Xiao Mi'").pl()

The real power of using DuckDB with Polars DataFrame is when you want to query from multiple dataframes. Consider the following three CSV files from the 2015 Flights Delay dataset:

2015 Flights Delay datasethttps://www.kaggle.com/datasets/usdot/flight-delays. Licensing — CC0: Public Domain

  • flights.csv
  • airlines.csv
  • airports.csv

Let’s load them up using Polars:

import polars as pl

df_flights = pl.scan_csv('flights.csv')
df_airlines = pl.scan_csv('airlines.csv')
df_airports = pl.scan_csv('airports.csv')

display(df_flights.collect().head())
display(df_airlines.collect().head())
display(df_airports.collect().head())

The above statements use lazy evaluation to load up the three CSV files. This ensures that any queries on the dataframes are not performed until all the queries are optimized. The collect() function forces Polars to load the CSV files into dataframes.

Here is how the df_flights, df_airlines, and df_airports dataframes look like:

Suppose you want to count the number of times an airline has a delay , and at the same time display the name of each airline, here is the SQL statement that you can use using the df_airlines and df_flights dataframes:

duckdb.sql('''
SELECT
count(df_airlines.AIRLINE) as Count,
df_airlines.AIRLINE
FROM df_flights, df_airlines
WHERE df_airlines.IATA_CODE = df_flights.AIRLINE AND df_flights.ARRIVAL_DELAY > 0
GROUP BY df_airlines.AIRLINE
ORDER BY COUNT DESC
''')

And here is the result:

If you want to count the number of airports in each state and sort the count in descending order, you can use the following SQL statement:

duckdb.sql('''
SELECT STATE, Count(*) as AIRPORT_COUNT
FROM df_airports
GROUP BY STATE
ORDER BY AIRPORT_COUNT DESC
''')

Finally, suppose you want to know which airline has the highest average delay. You can use the following SQL statement to calculate the various statistics, such as minimum arrival delay, maximum array delay, mean arrival delay, and standard deviation of arrival delay:

duckdb.sql('''
SELECT AIRLINE, MIN(ARRIVAL_DELAY), MAX(ARRIVAL_DELAY),
MEAN(ARRIVAL_DELAY), stddev(ARRIVAL_DELAY)
FROM df_flights
GROUP BY AIRLINE
ORDER BY MEAN(ARRIVAL_DELAY)
''')

Based on the mean arrival delay, we can see that the AS airline is the one with the shortest delay (as the value is negative, this means most of the time it arrives earlier!) and NK airline is the one with the longest delay. Want to know what is the AS airline? Try it out using what you have just learned! I will leave it as an exercise and the answer is at the end of this article.

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

In this short article, I illustrated how DuckDB and Polars can be used together to query your dataframes. Utilizing both libraries gives you the best of both worlds — using a familiar querying language (which is SQL) to query an efficient dataframe. Go ahead and try it out using your own dataset and share with us how it has helped your data analytics processes.

Answer to quiz:

duckdb.sql("SELECT AIRLINE from df_airlines WHERE IATA_CODE = 'AS'")

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment