Techno Blender
Digitally Yours.

Simple Parquet Tutorial and Best Practices | by Arli | Dec, 2022

0 40


Photo by Jeriden Villegas on Unsplash

Parquet file is a file storage system that changes the life of anyone who is concerned with day-to-day manipulations of data between several Data users such as Data Engineers, Data Scientists, Analytics Engineers, and other technical roles.

The principle of Parquet lies in its column-oriented storage and the fact that data is more homogeneous along the columns than along the rows, providing an optimized compression of data leading to less storage size and increased processing speed.

In this tutorial, we’ll outline some best practices to get you started with your learning of Parquet.

To begin, we will work with a publicly available dataset of credit card applications. The dataset is available on Kaggle: Credit Card Approval Prediction | Kaggle.

The easiest way to get this data is to install the Kaggle API in your environment, extract the data and unpack the archive to your working folder:

pip install kaggle
kaggle datasets download -d rikdifos/credit-card-approval-prediction
unzip credit-card-approval-prediction.zip

Let’s load the packages that are needed for the tutorial. Any version of pyarrow above 6.0.0 should work.

import pandas as pd
import numpy as np
import pyarrow as pa

In the zip archive, you will have credit_record.csv (a dataset about the monthly status of the credit of the clients) and application_record.csv (information about the clients). For simplicity, we will only be interested in the application_record.csv file in the zip archive.

To make things interesting, we will replicate the data 10 times and reset the IDs, making the data frame about 4 million rows and 18 columns.

applications = pd.read_csv('application_record.csv')
applications = pd.concat(10*[applications]).reset_index().drop(columns=['ID','index']).reset_index().rename(columns={'index':'ID'})

Below is an overview of the first 5 rows of the DataFrame (column-oriented):

applications.head(5).T

We notice that variables named FLAG_X do not share the same typology of output and normalizing them to Boolean should be a sound choice.

First, let’s build some simple features, a monthly version of the household income and the age of the applicant:

applications['MONTH_INCOME_TOTAL'] = applications['AMT_INCOME_TOTAL']/12
applications['AGE'] = - np.floor(applications['DAYS_BIRTH']/365)

Now we want to share this data with other Data users. For that, we will save the data in CSV and parquet formats to some paths that other users can access. But before that, let’s look at the notion of schema.

For parquet files, an invaluable practice is to define the schema of the dataset. The reason is that it will significantly increase the consistency and robustness of the data you are sharing without any Type ambiguity on the columns when transferring data among users.

To get a schema of a pandas DataFrame, just use from_pandas of pyarrow.Schema. Internally, the function matches the type of the DataFrame columns to a type that pyarrow can understand for Parquet file.

my_schema = pa.Schema.from_pandas(applications)
my_schema

From the schema above we notice that we will be better off doing two manipulations:

– The flags variables are actually Booleans (1/0) and storing them as such would save us storage in addition to avoiding any Type ambiguity.
DAYS_BIRTH is redundant with AGE, therefore we could delete it.

Schema can do both operations with ease (type conversion and variable filtering). The method set is for updating the i-th column with the second argument which should be a pyarrow.field object.

my_schema = my_schema.set(12, pa.field('FLAG_MOBIL', 'bool'))
my_schema = my_schema.set(13, pa.field('FLAG_WORK_PHONE', 'bool'))
my_schema = my_schema.set(14, pa.field('FLAG_PHONE', 'bool'))
my_schema = my_schema.set(15, pa.field('FLAG_EMAIL', 'bool'))
my_schema = my_schema.remove(10)

Now let’s compare the execution times of saving Parquet and CSV files:

%%time
applications.to_parquet('applications_processed.parquet', schema = my_schema)
%%time
applications.to_csv('applications_processed.csv')

Storing in CSV format does not allow any Type declaration, unlike Parquet schema, and there is a significant difference in execution time, saving in Parquet format is 5–6 times faster than in CSV format.

You just witnessed the processing speed offered by Parquet files.

And for the reduction of storage size, the difference in storage for Parquet files is nearly 20 times cheaper in this example (636MB for CSV — 39MB for parquet).

Overall, processing speed and storage reduction are the main advantages of Parquet files, but they are not the only ones.

Another, very interesting point about Parquet is that you can split the data by partitions, meaning grouping together information related to the same value on the partition name.

You can look at partitioning your data as arranging books of the same genre together in your library. It has many pros just as with arranging books:

  • the users of the data can access a specified group of data, increasing significantly the loading speed and reducing the RAM consumption
  • the producers of the data can parallelize the processing allowing scalability of the data size and scalable reduction of the run-times

I will show you below how to produce Parquet partitioned data.

From the column NAME_INCOME_TYPE, we observe that there are only 5 distinct values referring to a category of professional activity of the client.

Suppose now, that we as the producer of the data, want to save to Parquet files but partitioned on this column because the user of the data is interested in looking at those professional activities separately:

applications.to_parquet('APPLICATIONS_PROCESSED', schema = my_schema, partition_cols=['NAME_INCOME_TYPE'])

Notice that when we save to Parquet with one or more partition columns, we have to provide a folder path instead of a Parquet file path, because the method to_parquet will create all the subfolders and Parquet files needed with respect to the partition_cols.

The generated APPLICATIONS_PROCESSED folder now contains a folder for each category of credit applicants based on their NAME_INCOME_TYPE.

Now, the end-users interested in making analysis or decision choices on a specific professional category, like ‘State servant’ can load the data at light speed and just for this group of credit applicants.

We created the data as partitioned Parquet files.

But how can a user access them? It is what we are going to see by taking the point of view of the user of the data.

There are several ways to read Parquet files.

If it was generated by the producer without partition columns and let’s say we, as a user, are interested in candidates who are ‘salaried workers’ we have to write:

%%time
test = pd.read_parquet('applications_processed.parquet')
test[test.NAME_INCOME_TYPE=='Working']

It took nearly 5 seconds for this operation, it is still better than reading in CSV but it is not optimal because the user is loading all the data and filtering them after that. It means we are wasting precious RAM and computation time.

We just saw partitions exist, and a skilled Data Engineer generated the data partitioned by NAME_INCOME_TYPE so that we can speed up the read time and reduce our RAM consumption by simply loading the partition of interest.

There are two ways to do that with similar execution times:

  • We can read the partition path directly for NAME_INCOME_TYPE as ‘Working’
  • Or we can use the filters list argument to reach the partition in the same manner. Instead of going directly to the path, the filters option will look at all the partitions in the folder and pick the one(s) that respect your condition, here NAME_INCOME_TYPE = ‘Working’.

The 2 possibilities are listed below:

%%time
pd.read_parquet('APPLICATIONS_PROCESSED/NAME_INCOME_TYPE=Working/')
# OR (the run-time below corresponds to either one of way of reading)
pd.read_parquet('APPLICATIONS_PROCESSED', filters=[('NAME_INCOME_TYPE', '=', 'Working')])

Have you seen the increase in speed? It is 3 times faster and you did not have to load the entire data, saving so much RAM on your machine.

There is one difference between the 2 ways of reading partitioned data but it is nearly imperceptible in our example: reading the data with filters depends on the number of partitions inside. Indeed, here we just have 5 partitions on NAME_INCOME_TYPE, so the run-time between the Path and the filters methods of reading the data is the same. However, if we have let’s say 1000 partitions, the run-time difference will be significant, since Apache Parquet has to discover all the partitions and returns the one that matches your filters.

However, I highlight the fact that reading with filters is much more powerful in terms of flexibility and capacity, and I strongly encourage you to try it, if the trade-off in run-time is acceptable.

A simple example is that if you want to read specifically two or more partitions (but not all of them), you cannot do that efficiently with the Path reading, but with the filters, you can.

An example below is if we are interested not only in the standard “working” applicants but also “state servant” applicants:

filters = [('NAME_INCOME_TYPE', 'in', ['Working', 'State servant'])]
pd.read_parquet('APPLICATIONS_PROCESSED', filters=filters])

You could also be interested in loading the entire data even if it was partitioned on NAME_INCOME_TYPE by giving the folder path:

pd.read_parquet('APPLICATIONS_PROCESSED')

Notice that on the Dataframe generated, the column NAME_INCOME_TYPE is present, while it was suppressed if you were reading a partition of NAME_INCOME_TYPE.

This is normal behavior: by reading simply the partition, Apache Parquet on partitioned data assumes that you already know that this column is present for this specific filtered partition value (because you tell him to filter on NAME_INCOME_TYPE), so it does not repeat the column in the output DataFrame because it will just be a column with a unique value for NAME_INCOME_TYPE.

To sum up, we outlined best practices for using Parquet, including defining a schema and partitioning data. We also emphasized the advantages of using Parquet files in terms of processing speed and storage efficiency (on both hard drives and RAM). Additionally, we considered the perspective of the user of the data and the relevance and simplicity of working with Parquet files.

Thanks for reading and see you in another story!


Photo by Jeriden Villegas on Unsplash

Parquet file is a file storage system that changes the life of anyone who is concerned with day-to-day manipulations of data between several Data users such as Data Engineers, Data Scientists, Analytics Engineers, and other technical roles.

The principle of Parquet lies in its column-oriented storage and the fact that data is more homogeneous along the columns than along the rows, providing an optimized compression of data leading to less storage size and increased processing speed.

In this tutorial, we’ll outline some best practices to get you started with your learning of Parquet.

To begin, we will work with a publicly available dataset of credit card applications. The dataset is available on Kaggle: Credit Card Approval Prediction | Kaggle.

The easiest way to get this data is to install the Kaggle API in your environment, extract the data and unpack the archive to your working folder:

pip install kaggle
kaggle datasets download -d rikdifos/credit-card-approval-prediction
unzip credit-card-approval-prediction.zip

Let’s load the packages that are needed for the tutorial. Any version of pyarrow above 6.0.0 should work.

import pandas as pd
import numpy as np
import pyarrow as pa

In the zip archive, you will have credit_record.csv (a dataset about the monthly status of the credit of the clients) and application_record.csv (information about the clients). For simplicity, we will only be interested in the application_record.csv file in the zip archive.

To make things interesting, we will replicate the data 10 times and reset the IDs, making the data frame about 4 million rows and 18 columns.

applications = pd.read_csv('application_record.csv')
applications = pd.concat(10*[applications]).reset_index().drop(columns=['ID','index']).reset_index().rename(columns={'index':'ID'})

Below is an overview of the first 5 rows of the DataFrame (column-oriented):

applications.head(5).T

We notice that variables named FLAG_X do not share the same typology of output and normalizing them to Boolean should be a sound choice.

First, let’s build some simple features, a monthly version of the household income and the age of the applicant:

applications['MONTH_INCOME_TOTAL'] = applications['AMT_INCOME_TOTAL']/12
applications['AGE'] = - np.floor(applications['DAYS_BIRTH']/365)

Now we want to share this data with other Data users. For that, we will save the data in CSV and parquet formats to some paths that other users can access. But before that, let’s look at the notion of schema.

For parquet files, an invaluable practice is to define the schema of the dataset. The reason is that it will significantly increase the consistency and robustness of the data you are sharing without any Type ambiguity on the columns when transferring data among users.

To get a schema of a pandas DataFrame, just use from_pandas of pyarrow.Schema. Internally, the function matches the type of the DataFrame columns to a type that pyarrow can understand for Parquet file.

my_schema = pa.Schema.from_pandas(applications)
my_schema

From the schema above we notice that we will be better off doing two manipulations:

– The flags variables are actually Booleans (1/0) and storing them as such would save us storage in addition to avoiding any Type ambiguity.
DAYS_BIRTH is redundant with AGE, therefore we could delete it.

Schema can do both operations with ease (type conversion and variable filtering). The method set is for updating the i-th column with the second argument which should be a pyarrow.field object.

my_schema = my_schema.set(12, pa.field('FLAG_MOBIL', 'bool'))
my_schema = my_schema.set(13, pa.field('FLAG_WORK_PHONE', 'bool'))
my_schema = my_schema.set(14, pa.field('FLAG_PHONE', 'bool'))
my_schema = my_schema.set(15, pa.field('FLAG_EMAIL', 'bool'))
my_schema = my_schema.remove(10)

Now let’s compare the execution times of saving Parquet and CSV files:

%%time
applications.to_parquet('applications_processed.parquet', schema = my_schema)
%%time
applications.to_csv('applications_processed.csv')

Storing in CSV format does not allow any Type declaration, unlike Parquet schema, and there is a significant difference in execution time, saving in Parquet format is 5–6 times faster than in CSV format.

You just witnessed the processing speed offered by Parquet files.

And for the reduction of storage size, the difference in storage for Parquet files is nearly 20 times cheaper in this example (636MB for CSV — 39MB for parquet).

Overall, processing speed and storage reduction are the main advantages of Parquet files, but they are not the only ones.

Another, very interesting point about Parquet is that you can split the data by partitions, meaning grouping together information related to the same value on the partition name.

You can look at partitioning your data as arranging books of the same genre together in your library. It has many pros just as with arranging books:

  • the users of the data can access a specified group of data, increasing significantly the loading speed and reducing the RAM consumption
  • the producers of the data can parallelize the processing allowing scalability of the data size and scalable reduction of the run-times

I will show you below how to produce Parquet partitioned data.

From the column NAME_INCOME_TYPE, we observe that there are only 5 distinct values referring to a category of professional activity of the client.

Suppose now, that we as the producer of the data, want to save to Parquet files but partitioned on this column because the user of the data is interested in looking at those professional activities separately:

applications.to_parquet('APPLICATIONS_PROCESSED', schema = my_schema, partition_cols=['NAME_INCOME_TYPE'])

Notice that when we save to Parquet with one or more partition columns, we have to provide a folder path instead of a Parquet file path, because the method to_parquet will create all the subfolders and Parquet files needed with respect to the partition_cols.

The generated APPLICATIONS_PROCESSED folder now contains a folder for each category of credit applicants based on their NAME_INCOME_TYPE.

Now, the end-users interested in making analysis or decision choices on a specific professional category, like ‘State servant’ can load the data at light speed and just for this group of credit applicants.

We created the data as partitioned Parquet files.

But how can a user access them? It is what we are going to see by taking the point of view of the user of the data.

There are several ways to read Parquet files.

If it was generated by the producer without partition columns and let’s say we, as a user, are interested in candidates who are ‘salaried workers’ we have to write:

%%time
test = pd.read_parquet('applications_processed.parquet')
test[test.NAME_INCOME_TYPE=='Working']

It took nearly 5 seconds for this operation, it is still better than reading in CSV but it is not optimal because the user is loading all the data and filtering them after that. It means we are wasting precious RAM and computation time.

We just saw partitions exist, and a skilled Data Engineer generated the data partitioned by NAME_INCOME_TYPE so that we can speed up the read time and reduce our RAM consumption by simply loading the partition of interest.

There are two ways to do that with similar execution times:

  • We can read the partition path directly for NAME_INCOME_TYPE as ‘Working’
  • Or we can use the filters list argument to reach the partition in the same manner. Instead of going directly to the path, the filters option will look at all the partitions in the folder and pick the one(s) that respect your condition, here NAME_INCOME_TYPE = ‘Working’.

The 2 possibilities are listed below:

%%time
pd.read_parquet('APPLICATIONS_PROCESSED/NAME_INCOME_TYPE=Working/')
# OR (the run-time below corresponds to either one of way of reading)
pd.read_parquet('APPLICATIONS_PROCESSED', filters=[('NAME_INCOME_TYPE', '=', 'Working')])

Have you seen the increase in speed? It is 3 times faster and you did not have to load the entire data, saving so much RAM on your machine.

There is one difference between the 2 ways of reading partitioned data but it is nearly imperceptible in our example: reading the data with filters depends on the number of partitions inside. Indeed, here we just have 5 partitions on NAME_INCOME_TYPE, so the run-time between the Path and the filters methods of reading the data is the same. However, if we have let’s say 1000 partitions, the run-time difference will be significant, since Apache Parquet has to discover all the partitions and returns the one that matches your filters.

However, I highlight the fact that reading with filters is much more powerful in terms of flexibility and capacity, and I strongly encourage you to try it, if the trade-off in run-time is acceptable.

A simple example is that if you want to read specifically two or more partitions (but not all of them), you cannot do that efficiently with the Path reading, but with the filters, you can.

An example below is if we are interested not only in the standard “working” applicants but also “state servant” applicants:

filters = [('NAME_INCOME_TYPE', 'in', ['Working', 'State servant'])]
pd.read_parquet('APPLICATIONS_PROCESSED', filters=filters])

You could also be interested in loading the entire data even if it was partitioned on NAME_INCOME_TYPE by giving the folder path:

pd.read_parquet('APPLICATIONS_PROCESSED')

Notice that on the Dataframe generated, the column NAME_INCOME_TYPE is present, while it was suppressed if you were reading a partition of NAME_INCOME_TYPE.

This is normal behavior: by reading simply the partition, Apache Parquet on partitioned data assumes that you already know that this column is present for this specific filtered partition value (because you tell him to filter on NAME_INCOME_TYPE), so it does not repeat the column in the output DataFrame because it will just be a column with a unique value for NAME_INCOME_TYPE.

To sum up, we outlined best practices for using Parquet, including defining a schema and partitioning data. We also emphasized the advantages of using Parquet files in terms of processing speed and storage efficiency (on both hard drives and RAM). Additionally, we considered the perspective of the user of the data and the relevance and simplicity of working with Parquet files.

Thanks for reading and see you in another story!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment