Optimize your costs and speed up your queries
Partitioning and clustering are two very effective techniques to minimize query costs and increase performance (using fewer resources while improving speed).
The idea behind these techniques is to limit the amount of data that needs to be read when running a query.
Essentially, we will not need to read through each record in a table but read smaller sections of it.
If you are unsure whether to use partitions or clusters, I would still recommend setting up both of them, as it will provide speed and cost benefits. You can use both techniques independently or combined to get optimal results.
Partitions
Table partitioning is a technique for splitting large tables into smaller ones.
Here’s an example of converting a classic table to a table partitioned by date.
BigQuery will store separately the different partitions at a physical level (meaning the data will be stored on different servers).
When you partition a table and then execute a query, it is also BigQuery that determines which partition to access and minimizes the data that must be read.
You can create a partitioned table based on a column, also known as a partitioning key. In BigQuery, you can partition your table using different keys:
- Time-unit column: Tables are partitioned based on a time value such as timestamps or dates.
- Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
- Integer range: Tables are partitioned based on a number.
BigQuery has a limit of 4,000 partitions per table.
Our starting table
We have for our experiment, a table called stackoverflow.question
which is 35GB big and has 22 million rows.
Before we do any partition or clustering, we want a benchmark of speed (how long our query needs to run) and volume (how much data is read) on a classic table.
The following query, which we will refer to as our base query, will be used to compare performance.
Disabling cached results, it took us 12,7 seconds for our query to run and we scanned 35GB of data. In this case, we do have to scan all the rows in our table, including all the tags, and all the dates.
A concrete example of partitioning a table
We will use a time-unit column (in our case a date column) to make our partitioning.
Now let’s create a partitioned table using our creation_date
column as a key:
Instead of partitioning by day, we convert the creation date as a month with DATE_TRUNC(creation_date,MONTH)
. We do so because it reduces the number of partitions needed for this table.
If we tried to partition by day, BigQuery would return an error:
Too many partitions produced by query, allowed 4000, query produces at least 4967 partitions
Let’s run again our base query, but this time using our partitioned table.
It took us 9.7seconds to run this query and scanned only 7.1GB of data. We could not improve the speed so much but the amount of data scanned is a lot less, leading to cheaper queries.
Hints: You can add additional options with partitioning, for example, specifying an expiration time for a partition or asking BigQuery to ensure that users are always using a partition filter (and not querying the entire table by mistake)
- partition_expiration_days: BigQuery deletes the data in a partition when it expires. This means that data in partitions older than the number of days specified here will be deleted.
- require_partition_filter: Users can’t query without filtering (WHERE clause) on your partition key.
You can always change the values of the options by altering the table.
All the information about options and partitioning (and clustering) can be found under the details sections of your table.
Clusters
Clusters will allow BigQuery to keep data that is similar closer together, allowing a query to scan fewer data.
Here’s an example of converting a classic table to a table clustered by name.
Based on the values in the column you chose for clustering, BigQuery will automatically sort these values and also decide on how to store them in optimal storage blocks.
Clusters work best with values that have high cardinality, which means columns with various possible values such as emails, user ids, names, categories of a product, etc…
BigQuery allows you to cluster on multiple columns and you can cluster different data types (STRING, DATE, NUMERIC, etc…)
BigQuery has a limit of 4 cluster columns per table.
A concrete example of clustering a table
We will use string values (in our case thetags
column from our StackOverflow table) to make our clustering.
To create a clustered table, the steps are the same as for partitioning but the syntax will use a CLUSTER BY
clause.
We can now verify that our table is clustered on the values contained in the tags
column.
To evaluate the performance, we run the same query as before using now the clustered table.
It took us 2.9 seconds to run this query but we had to scan 35.8GB of data. In this case, the speed is improved but not the amount of data needed to be scanned.
Partitioning + Clustering
We can now easily combine both techniques to optimize even more.
Here’s an example of converting a classic table to a table clustered by name and partitioned by date.
As you can see, it is very similar to partitioned tables, except the values are now sorted when they are similar inside each partition.
Now to create a table that is using both techniques, it is as simple as writing both clauses when creating the table.
We can observe in the table details that this table is using both partitioning and clustering.
To analyze the performance once more, we execute the same query as before, but this time we use our most recently generated table, which employs both clustering and partitioning.
It took us 3.1 seconds to run this query and only scanned 7.1GB of data. It is comparable to what we saw with both separate techniques, which are now being combined to improve speed and cost.
In general, it is a good practice to combine and implement both techniques. When you ingest data from external sources, with a cron job or via streaming, it can beinteresting to partition by ingestion time.
I hope you found this article useful and that you’ll now be able to create and implement clusters and partitions in your projects!
Additionnal ressources
Ressources from the Google Cloud Platform documentation on Partitioned Tables.
Ressources from the Google Cloud Platform documentation on Clustered Tables.
Optimize your costs and speed up your queries
Partitioning and clustering are two very effective techniques to minimize query costs and increase performance (using fewer resources while improving speed).
The idea behind these techniques is to limit the amount of data that needs to be read when running a query.
Essentially, we will not need to read through each record in a table but read smaller sections of it.
If you are unsure whether to use partitions or clusters, I would still recommend setting up both of them, as it will provide speed and cost benefits. You can use both techniques independently or combined to get optimal results.
Partitions
Table partitioning is a technique for splitting large tables into smaller ones.
Here’s an example of converting a classic table to a table partitioned by date.
BigQuery will store separately the different partitions at a physical level (meaning the data will be stored on different servers).
When you partition a table and then execute a query, it is also BigQuery that determines which partition to access and minimizes the data that must be read.
You can create a partitioned table based on a column, also known as a partitioning key. In BigQuery, you can partition your table using different keys:
- Time-unit column: Tables are partitioned based on a time value such as timestamps or dates.
- Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
- Integer range: Tables are partitioned based on a number.
BigQuery has a limit of 4,000 partitions per table.
Our starting table
We have for our experiment, a table called stackoverflow.question
which is 35GB big and has 22 million rows.
Before we do any partition or clustering, we want a benchmark of speed (how long our query needs to run) and volume (how much data is read) on a classic table.
The following query, which we will refer to as our base query, will be used to compare performance.
Disabling cached results, it took us 12,7 seconds for our query to run and we scanned 35GB of data. In this case, we do have to scan all the rows in our table, including all the tags, and all the dates.
A concrete example of partitioning a table
We will use a time-unit column (in our case a date column) to make our partitioning.
Now let’s create a partitioned table using our creation_date
column as a key:
Instead of partitioning by day, we convert the creation date as a month with DATE_TRUNC(creation_date,MONTH)
. We do so because it reduces the number of partitions needed for this table.
If we tried to partition by day, BigQuery would return an error:
Too many partitions produced by query, allowed 4000, query produces at least 4967 partitions
Let’s run again our base query, but this time using our partitioned table.
It took us 9.7seconds to run this query and scanned only 7.1GB of data. We could not improve the speed so much but the amount of data scanned is a lot less, leading to cheaper queries.
Hints: You can add additional options with partitioning, for example, specifying an expiration time for a partition or asking BigQuery to ensure that users are always using a partition filter (and not querying the entire table by mistake)
- partition_expiration_days: BigQuery deletes the data in a partition when it expires. This means that data in partitions older than the number of days specified here will be deleted.
- require_partition_filter: Users can’t query without filtering (WHERE clause) on your partition key.
You can always change the values of the options by altering the table.
All the information about options and partitioning (and clustering) can be found under the details sections of your table.
Clusters
Clusters will allow BigQuery to keep data that is similar closer together, allowing a query to scan fewer data.
Here’s an example of converting a classic table to a table clustered by name.
Based on the values in the column you chose for clustering, BigQuery will automatically sort these values and also decide on how to store them in optimal storage blocks.
Clusters work best with values that have high cardinality, which means columns with various possible values such as emails, user ids, names, categories of a product, etc…
BigQuery allows you to cluster on multiple columns and you can cluster different data types (STRING, DATE, NUMERIC, etc…)
BigQuery has a limit of 4 cluster columns per table.
A concrete example of clustering a table
We will use string values (in our case thetags
column from our StackOverflow table) to make our clustering.
To create a clustered table, the steps are the same as for partitioning but the syntax will use a CLUSTER BY
clause.
We can now verify that our table is clustered on the values contained in the tags
column.
To evaluate the performance, we run the same query as before using now the clustered table.
It took us 2.9 seconds to run this query but we had to scan 35.8GB of data. In this case, the speed is improved but not the amount of data needed to be scanned.
Partitioning + Clustering
We can now easily combine both techniques to optimize even more.
Here’s an example of converting a classic table to a table clustered by name and partitioned by date.
As you can see, it is very similar to partitioned tables, except the values are now sorted when they are similar inside each partition.
Now to create a table that is using both techniques, it is as simple as writing both clauses when creating the table.
We can observe in the table details that this table is using both partitioning and clustering.
To analyze the performance once more, we execute the same query as before, but this time we use our most recently generated table, which employs both clustering and partitioning.
It took us 3.1 seconds to run this query and only scanned 7.1GB of data. It is comparable to what we saw with both separate techniques, which are now being combined to improve speed and cost.
In general, it is a good practice to combine and implement both techniques. When you ingest data from external sources, with a cron job or via streaming, it can beinteresting to partition by ingestion time.
I hope you found this article useful and that you’ll now be able to create and implement clusters and partitions in your projects!
Additionnal ressources
Ressources from the Google Cloud Platform documentation on Partitioned Tables.
Ressources from the Google Cloud Platform documentation on Clustered Tables.