Techno Blender
Digitally Yours.

4 Ways to Write Data To Parquet With Python: A Comparison | by Antonello Benedetto | Mar, 2023

0 46


Learn How To Efficiently Write Data To Parquet Format Using Pandas, FastParquet, PyArrow or PySpark

Photo by Dominika Roseclay

In today’s data-driven world, efficient storage and processing of large datasets is a crucial requirement for many businesses and organisations. This is where the Parquet file format comes into play.

Parquet is a columnar storage format that is designed to optimise data processing and querying performance while minimising storage space.

It is particularly well-suited for use cases where data needs to be analysed quickly and efficiently, such as in data warehousing, big data analytics, and machine learning applications.

In this article, I will demonstrate how to write data to Parquet files in Python using four different libraries: Pandas, FastParquet, PyArrow, and PySpark.

In particular, you will learn how to:

  • retrieve data from a database, convert it to a DataFrame, and use each one of these libraries to write records to a Parquet file.
  • write data to Parquet files in batches, to optimise performance and memory usage.

By the end of this article, you’ll have a thorough understanding of how to use Python to write Parquet files and unlock the full power of this efficient storage format.

The dataset used as part of this tutorial, includes mock data about daily account balances in different currencies and for different companies.

Data has been generated using a Python recursive function and then inserted into a SnowFlake DB table.

Then, a connection to the DB has been established using either the Python snowflake.connector or the native PySpark connectivity tools (paired with jars), to retrieve the dataset and convert it to DF format.

The code to achieve the steps above is available here, whereas the first few rows of the DF, are displayed below:

Mock data generated by the author, fetched and converted to DF.

Let’s now describe four different strategies to write this dataset to parquet format using Python.

As you can see, the common starting point to all methods, is having data already converted to either a Pandas or PySpark DF.

This will make the code much more concise, readable and your life easier.

Probably the simplest way to write dataset to parquet files, is by using the to_parquet() method in the pandas module:

# METHOD 1 - USING PLAIN PANDAS
import pandas as pd

parquet_file = 'example_pd.parquet'

df.to_parquet(parquet_file, engine = 'pyarrow', compression = 'gzip')

logging.info('Parquet file named "%s" has been written to disk', parquet_file)

In this case, engine = 'pyarrow' has been used as PyArrow is a high-performance Python library for working with Apache Arrow data. It provides a fast and memory-efficient implementation of the Parquet file format, which can improve the write performance of parquet files.

In addition, PyArrow supports a range of compression algorithms, including gzip, snappy and LZ4.

Throughout this tutorial, let’s pretend that the goal was to achieve high compression ratios (that deliver smaller Parquet files in size), even at the cost of slower compression and decompression speeds. This is why compression = 'gzip' has been used.

Another popular way to write datasets to parquet files is by using the fastparquet package. In the simplest form, its write() method accepts a pandas DF as an input dataset and can compress it using variety of algorithms:

# METHOD 2.1 - USING PANDAS & FASTPARQUET (WRITE IN FULL)
import pandas as pd
import fastparquet as fp

parquet_file = 'example_pd_fp_1.parquet'

fp.write(parquet_file, df, compression = 'GZIP')

logging.info('Parquet file named "%s" has been written to disk', parquet_file)

But then, if fastparquet works with pandas DFs anyway, why shouldn’t Method #1 be used instead?

Well, there are at least a couple of reasons to use fastparquet package:

  • it is designed to be a lightweight package, known for its fast write performance and efficient memory usage.
  • it provides a pure Python implementation of the parquet format and then offers much more flexibility and options to developers.

For instance, by using the fp.write() method, you can specify the option append = True , something that is not yet possible through the to_parquet() method.

This option becomes particularly handy when the source dataset is too large to be written in memory in one go, so that, to avoid OOM errors you decide to write it to the parquet files in batches.

The code below, is a valid example of how to write large datasets to a parquet file, in batches, by combining the power of pandas and fastparquet:

# METHOD 2.2 - USING PANDAS & FASTPARQUET (WRITE IN BATCHES)
import pandas as pd
import fastparquet as fp

# SETTING BATCH SIZE
batch_size = 250

data = db_cursor_sf.fetchmany(batch_size)
columns = [desc[0] for desc in db_cursor_sf.description]

# CREATES INITIAL DF INCLUDING 1ST BATCH
df = pd.DataFrame(data, columns = columns)
df = df.astype(schema)

# WRITES TO PARQUET FILE
parquet_file = 'example_pd_fp_2.parquet'
fp.write(parquet_file, df, compression = 'GZIP')

total_rows = df.shape[0]
total_cols = df.shape[1]

# SEQUENTIALLY APPEND TO PARQUET FILE
while True:
data = db_cursor_sf.fetchmany(batch_size)
if not data:
break
df = pd.DataFrame(data, columns = columns)
df = df.astype(schema)

total_rows += df.shape[0]

# Write the rows to the Parquet file
fp.write(parquet_file, df, append=True, compression = 'GZIP')

logging.info('Full parquet file named "%s" has been written to disk \
with %s total rows', parquet_file, total_rows)

The strategy used in the code above is to:

  • Define a batch_size: in this case of this tutorial it has been set to only 250 rows but in production it can easily be increased to several million rows, depending on memory available on the worker that will perform the job.
  • Fetch the very first batch of rows from the database, using fetchmany() instead of fetchall() and use this dataset to create a pandas DF. Write such DF to a parquet file.
  • Instantiate a WHILE loop that will keep fetching data from the DB in batches, converting it to pandans DFs and eventually append it to the initial parquet file.

The WHILE loop will keep running until the last row has been written to the parquet file. Since the batch_size can be a variable updated depending on the use case, this should be considered a much more “controlled” and memory efficient method to write to a files with python.

Earlier in the tutorial, it has been mentioned that pyarrow is an high performance Python library that also provides a fast and memory efficient implementation of the parquet format.

Its power can be used indirectly (by setting engine = 'pyarrow' like in Method #1) or directly by using some of its native methods.

For example, you could easily replicate the code that writes to a parquet file in batches (Method # 2.2) , by using the ParquetWriter() method:

# EXAMPLE 3 - USING PANDAS & PYARROW
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# SETTING BATCH SIZE
batch_size = 250

parquet_schema = pa.schema([('as_of_date', pa.timestamp('ns')),
('company_code', pa.string()),
('fc_balance', pa.float32()),
('fc_currency_code', pa.string()),
('gl_account_number', pa.string()),
('gl_account_name', pa.string())
])

parquet_file = 'example_pa.parquet'

total_rows = 0

logging.info('Writing to file %s in batches...', parquet_file)

with pq.ParquetWriter(parquet_file, parquet_schema, compression='gzip') as writer:
while True:
data = db_cursor_pg.fetchmany(batch_size)
if not data:
break
df = pd.DataFrame(data, columns=list(parquet_schema.names))
df = df.astype(schema)

table = pa.Table.from_pandas(df)
total_rows += table.num_rows

writer.write_table(table)

logging.info('Full parquet file named "%s" has been written to disk \
with %s total rows', parquet_file, total_rows)

Note that, since behind the scenes pyarrow takes advantage of the Apache Arrow format, the ParquetWriter requires a pyarrow schema as an argument (which datatypes are fairly intuitive and somewhat similar to their pandas counterpart).

Moreover, while using this package, you are not allowed to write pandas DF directly, but those should be converted to a pyarrow.Table first (using the from_pandas() method, before the preferred dataset can be written to a file with the write_table() method.

Despite Method #3 is a bit more verbose compared to the others, the Apache Arrow format is particularly recommended if declaring a schema and the availability of columns statistics is paramount for your use-case.

The last and probably most flexible way to write to a parquet file, is by using a pyspark native df.write.parquet() method.

Of course the script below, assumes that you are connected to a DB and managed to load data into a DF, as shown here.

Note that in this case mode('overwrite') has been used, but you could easily switch it to mode('append') in case you wished to write data in batches. Also pyspark allows you to specify a large number of options among which the preferred 'compression' algorithm:

# EXAMPLE 4 - USING PYSPARK
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark import SparkConf

# CONNECT TO DB + LOAD DF

# WRITING TO PARQUET
df.write.mode('overwrite')\ # or append
.option('compression', 'gzip')\
.parquet("example_pyspark_final.parquet")

df.show()

In this article, we have explored four different Python libraries that allow you to write data to Parquet files, including Pandas, FastParquet, PyArrow, and PySpark.

Indeed, the Parquet file format is an essential tool for businesses and organisations that need to process and analyse large datasets quickly and efficiently.

You have also learned how to write data in batches, which can further optimise performance and memory usage. By mastering these skills, you will be able to leverage the full power of Parquet and take your data processing and analysis to the next level.


Learn How To Efficiently Write Data To Parquet Format Using Pandas, FastParquet, PyArrow or PySpark

Photo by Dominika Roseclay

In today’s data-driven world, efficient storage and processing of large datasets is a crucial requirement for many businesses and organisations. This is where the Parquet file format comes into play.

Parquet is a columnar storage format that is designed to optimise data processing and querying performance while minimising storage space.

It is particularly well-suited for use cases where data needs to be analysed quickly and efficiently, such as in data warehousing, big data analytics, and machine learning applications.

In this article, I will demonstrate how to write data to Parquet files in Python using four different libraries: Pandas, FastParquet, PyArrow, and PySpark.

In particular, you will learn how to:

  • retrieve data from a database, convert it to a DataFrame, and use each one of these libraries to write records to a Parquet file.
  • write data to Parquet files in batches, to optimise performance and memory usage.

By the end of this article, you’ll have a thorough understanding of how to use Python to write Parquet files and unlock the full power of this efficient storage format.

The dataset used as part of this tutorial, includes mock data about daily account balances in different currencies and for different companies.

Data has been generated using a Python recursive function and then inserted into a SnowFlake DB table.

Then, a connection to the DB has been established using either the Python snowflake.connector or the native PySpark connectivity tools (paired with jars), to retrieve the dataset and convert it to DF format.

The code to achieve the steps above is available here, whereas the first few rows of the DF, are displayed below:

Mock data generated by the author, fetched and converted to DF.

Let’s now describe four different strategies to write this dataset to parquet format using Python.

As you can see, the common starting point to all methods, is having data already converted to either a Pandas or PySpark DF.

This will make the code much more concise, readable and your life easier.

Probably the simplest way to write dataset to parquet files, is by using the to_parquet() method in the pandas module:

# METHOD 1 - USING PLAIN PANDAS
import pandas as pd

parquet_file = 'example_pd.parquet'

df.to_parquet(parquet_file, engine = 'pyarrow', compression = 'gzip')

logging.info('Parquet file named "%s" has been written to disk', parquet_file)

In this case, engine = 'pyarrow' has been used as PyArrow is a high-performance Python library for working with Apache Arrow data. It provides a fast and memory-efficient implementation of the Parquet file format, which can improve the write performance of parquet files.

In addition, PyArrow supports a range of compression algorithms, including gzip, snappy and LZ4.

Throughout this tutorial, let’s pretend that the goal was to achieve high compression ratios (that deliver smaller Parquet files in size), even at the cost of slower compression and decompression speeds. This is why compression = 'gzip' has been used.

Another popular way to write datasets to parquet files is by using the fastparquet package. In the simplest form, its write() method accepts a pandas DF as an input dataset and can compress it using variety of algorithms:

# METHOD 2.1 - USING PANDAS & FASTPARQUET (WRITE IN FULL)
import pandas as pd
import fastparquet as fp

parquet_file = 'example_pd_fp_1.parquet'

fp.write(parquet_file, df, compression = 'GZIP')

logging.info('Parquet file named "%s" has been written to disk', parquet_file)

But then, if fastparquet works with pandas DFs anyway, why shouldn’t Method #1 be used instead?

Well, there are at least a couple of reasons to use fastparquet package:

  • it is designed to be a lightweight package, known for its fast write performance and efficient memory usage.
  • it provides a pure Python implementation of the parquet format and then offers much more flexibility and options to developers.

For instance, by using the fp.write() method, you can specify the option append = True , something that is not yet possible through the to_parquet() method.

This option becomes particularly handy when the source dataset is too large to be written in memory in one go, so that, to avoid OOM errors you decide to write it to the parquet files in batches.

The code below, is a valid example of how to write large datasets to a parquet file, in batches, by combining the power of pandas and fastparquet:

# METHOD 2.2 - USING PANDAS & FASTPARQUET (WRITE IN BATCHES)
import pandas as pd
import fastparquet as fp

# SETTING BATCH SIZE
batch_size = 250

data = db_cursor_sf.fetchmany(batch_size)
columns = [desc[0] for desc in db_cursor_sf.description]

# CREATES INITIAL DF INCLUDING 1ST BATCH
df = pd.DataFrame(data, columns = columns)
df = df.astype(schema)

# WRITES TO PARQUET FILE
parquet_file = 'example_pd_fp_2.parquet'
fp.write(parquet_file, df, compression = 'GZIP')

total_rows = df.shape[0]
total_cols = df.shape[1]

# SEQUENTIALLY APPEND TO PARQUET FILE
while True:
data = db_cursor_sf.fetchmany(batch_size)
if not data:
break
df = pd.DataFrame(data, columns = columns)
df = df.astype(schema)

total_rows += df.shape[0]

# Write the rows to the Parquet file
fp.write(parquet_file, df, append=True, compression = 'GZIP')

logging.info('Full parquet file named "%s" has been written to disk \
with %s total rows', parquet_file, total_rows)

The strategy used in the code above is to:

  • Define a batch_size: in this case of this tutorial it has been set to only 250 rows but in production it can easily be increased to several million rows, depending on memory available on the worker that will perform the job.
  • Fetch the very first batch of rows from the database, using fetchmany() instead of fetchall() and use this dataset to create a pandas DF. Write such DF to a parquet file.
  • Instantiate a WHILE loop that will keep fetching data from the DB in batches, converting it to pandans DFs and eventually append it to the initial parquet file.

The WHILE loop will keep running until the last row has been written to the parquet file. Since the batch_size can be a variable updated depending on the use case, this should be considered a much more “controlled” and memory efficient method to write to a files with python.

Earlier in the tutorial, it has been mentioned that pyarrow is an high performance Python library that also provides a fast and memory efficient implementation of the parquet format.

Its power can be used indirectly (by setting engine = 'pyarrow' like in Method #1) or directly by using some of its native methods.

For example, you could easily replicate the code that writes to a parquet file in batches (Method # 2.2) , by using the ParquetWriter() method:

# EXAMPLE 3 - USING PANDAS & PYARROW
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# SETTING BATCH SIZE
batch_size = 250

parquet_schema = pa.schema([('as_of_date', pa.timestamp('ns')),
('company_code', pa.string()),
('fc_balance', pa.float32()),
('fc_currency_code', pa.string()),
('gl_account_number', pa.string()),
('gl_account_name', pa.string())
])

parquet_file = 'example_pa.parquet'

total_rows = 0

logging.info('Writing to file %s in batches...', parquet_file)

with pq.ParquetWriter(parquet_file, parquet_schema, compression='gzip') as writer:
while True:
data = db_cursor_pg.fetchmany(batch_size)
if not data:
break
df = pd.DataFrame(data, columns=list(parquet_schema.names))
df = df.astype(schema)

table = pa.Table.from_pandas(df)
total_rows += table.num_rows

writer.write_table(table)

logging.info('Full parquet file named "%s" has been written to disk \
with %s total rows', parquet_file, total_rows)

Note that, since behind the scenes pyarrow takes advantage of the Apache Arrow format, the ParquetWriter requires a pyarrow schema as an argument (which datatypes are fairly intuitive and somewhat similar to their pandas counterpart).

Moreover, while using this package, you are not allowed to write pandas DF directly, but those should be converted to a pyarrow.Table first (using the from_pandas() method, before the preferred dataset can be written to a file with the write_table() method.

Despite Method #3 is a bit more verbose compared to the others, the Apache Arrow format is particularly recommended if declaring a schema and the availability of columns statistics is paramount for your use-case.

The last and probably most flexible way to write to a parquet file, is by using a pyspark native df.write.parquet() method.

Of course the script below, assumes that you are connected to a DB and managed to load data into a DF, as shown here.

Note that in this case mode('overwrite') has been used, but you could easily switch it to mode('append') in case you wished to write data in batches. Also pyspark allows you to specify a large number of options among which the preferred 'compression' algorithm:

# EXAMPLE 4 - USING PYSPARK
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark import SparkConf

# CONNECT TO DB + LOAD DF

# WRITING TO PARQUET
df.write.mode('overwrite')\ # or append
.option('compression', 'gzip')\
.parquet("example_pyspark_final.parquet")

df.show()

In this article, we have explored four different Python libraries that allow you to write data to Parquet files, including Pandas, FastParquet, PyArrow, and PySpark.

Indeed, the Parquet file format is an essential tool for businesses and organisations that need to process and analyse large datasets quickly and efficiently.

You have also learned how to write data in batches, which can further optimise performance and memory usage. By mastering these skills, you will be able to leverage the full power of Parquet and take your data processing and analysis to the next level.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment