How to Use Partitions and Clusters in BigQuery Using SQL | by Romain Granger

Optimize your costs and speed up your queries

Partitioning and clustering are two very effective techniques to minimize query costs and increase performance (using fewer resources while improving speed).

The idea behind these techniques is to limit the amount of data that needs to be read when running a query.

Essentially, we will not need to read through each record in a table but read smaller sections of it.

If you are unsure whether to use partitions or clusters, I would still recommend setting up both of them, as it will provide speed and cost benefits. You can use both techniques independently or combined to get optimal results.

Partitions

Table partitioning is a technique for splitting large tables into smaller ones.

Here’s an example of converting a classic table to a table partitioned by date.

How partitioning divides data into distinct chunks (Image by Author)

BigQuery will store separately the different partitions at a physical level (meaning the data will be stored on different servers).

When you partition a table and then execute a query, it is also BigQuery that determines which partition to access and minimizes the data that must be read.

You can create a partitioned table based on a column, also known as a partitioning key. In BigQuery, you can partition your table using different keys:

Time-unit column: Tables are partitioned based on a time value such as timestamps or dates.
Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
Integer range: Tables are partitioned based on a number.

BigQuery has a limit of 4,000 partitions per table.

Our starting table

We have for our experiment, a table called stackoverflow.question which is 35GB big and has 22 million rows.

Before we do any partition or clustering, we want a benchmark of speed (how long our query needs to run) and volume (how much data is read) on a classic table.

The following query, which we will refer to as our base query, will be used to compare performance.

The base query for evaluating the performance of partitioning and clustering techniques

Bytes processed and execution time for a classic table (Image by Author)

Disabling cached results, it took us 12,7 seconds for our query to run and we scanned 35GB of data. In this case, we do have to scan all the rows in our table, including all the tags, and all the dates.

A concrete example of partitioning a table

We will use a time-unit column (in our case a date column) to make our partitioning.

Now let’s create a partitioned table using our creation_date column as a key:

Create a partitioned table from our initial classic table

Instead of partitioning by day, we convert the creation date as a month with DATE_TRUNC(creation_date,MONTH). We do so because it reduces the number of partitions needed for this table.

If we tried to partition by day, BigQuery would return an error:

Too many partitions produced by query, allowed 4000, query produces at least 4967 partitions

Let’s run again our base query, but this time using our partitioned table.

Bytes processed and execution time for a partitioned table (Image by Author)

It took us 9.7seconds to run this query and scanned only 7.1GB of data. We could not improve the speed so much but the amount of data scanned is a lot less, leading to cheaper queries.

Hints: You can add additional options with partitioning, for example, specifying an expiration time for a partition or asking BigQuery to ensure that users are always using a partition filter (and not querying the entire table by mistake)

partition_expiration_days: BigQuery deletes the data in a partition when it expires. This means that data in partitions older than the number of days specified here will be deleted.
require_partition_filter: Users can’t query without filtering (WHERE clause) on your partition key.

Create a partition table from our initial classic table with additional options

You can always change the values of the options by altering the table.

Updating the options from our partition table with new values

All the information about options and partitioning (and clustering) can be found under the details sections of your table.

Partitioning detail for the current table (Image by Author)

Clusters

Clusters will allow BigQuery to keep data that is similar closer together, allowing a query to scan fewer data.

Here’s an example of converting a classic table to a table clustered by name.

How clustering divides data into distinct chunks (Image by Author)

Based on the values in the column you chose for clustering, BigQuery will automatically sort these values and also decide on how to store them in optimal storage blocks.

Clusters work best with values that have high cardinality, which means columns with various possible values such as emails, user ids, names, categories of a product, etc…

BigQuery allows you to cluster on multiple columns and you can cluster different data types (STRING, DATE, NUMERIC, etc…)

BigQuery has a limit of 4 cluster columns per table.

A concrete example of clustering a table

We will use string values (in our case thetags column from our StackOverflow table) to make our clustering.

To create a clustered table, the steps are the same as for partitioning but the syntax will use a CLUSTER BY clause.

Create a clustered table from our initial classic table

We can now verify that our table is clustered on the values contained in the tags column.

The detail section will show which field(s) are used for clustering (Image by Author)

To evaluate the performance, we run the same query as before using now the clustered table.

Bytes processed and execution time for a clustered table (Image by Author)

It took us 2.9 seconds to run this query but we had to scan 35.8GB of data. In this case, the speed is improved but not the amount of data needed to be scanned.

Partitioning + Clustering

We can now easily combine both techniques to optimize even more.

Here’s an example of converting a classic table to a table clustered by name and partitioned by date.

How partitioning and clustering combined divide data into distinct chunks (Image by Author)

As you can see, it is very similar to partitioned tables, except the values are now sorted when they are similar inside each partition.

Now to create a table that is using both techniques, it is as simple as writing both clauses when creating the table.

Create a clustered and partitioned table from our initial classic table

We can observe in the table details that this table is using both partitioning and clustering.

The detail section will show which field(s) are used for clustering and partitioning (Image by Author)

To analyze the performance once more, we execute the same query as before, but this time we use our most recently generated table, which employs both clustering and partitioning.

Bytes processed and execution time for a partitioned + clustered table (Image by Author)

It took us 3.1 seconds to run this query and only scanned 7.1GB of data. It is comparable to what we saw with both separate techniques, which are now being combined to improve speed and cost.

In general, it is a good practice to combine and implement both techniques. When you ingest data from external sources, with a cron job or via streaming, it can beinteresting to partition by ingestion time.

I hope you found this article useful and that you’ll now be able to create and implement clusters and partitions in your projects!

Additionnal ressources

Ressources from the Google Cloud Platform documentation on Partitioned Tables.

Ressources from the Google Cloud Platform documentation on Clustered Tables.