Techno Blender
Digitally Yours.

SQL Anti-Patterns for BigQuery | Towards Data Science

0 52


Best practices and things to avoid when running SQL on Google Cloud BigQuery

Photo by Jack Carter on Unsplash

BigQuery is the managed Data Warehouse service on Google Cloud Platform, and like most services and technologies it comes with a set of principles that one needs to have in mind whilst using it.

In the next few sections we will outline a set of best practices to avoid common anti-patterns that usually impact negatively the performance in BigQuery. Applying best practices is important mainly for two reasons — they will help you write more efficient queries and at the same time, if applied correctly, will reduce your costs.

Avoid SELECT *

Selecting all the fields from a result set is a very common anti-pattern that should be avoided whenever possible. SELECT * will result in full scans for every column in the table which means that this is going to be an expensive operation to execute.

Query only the columns you need

Also remember that LIMIT won’t reduce the volume of bytes read and therefore, you will still pay for the full scan over every single column. Therefore, ensure that you query only the columns you actually need. In case you still need to run SELECT * then consider partitioning your table such that you will be able to query data that resides in one or some of the partitions.

Avoid Self-Joins

Performing a self-join over a table is another thing you should avoid. Naturally someone would ask what’s the difference between a join of two separate tables and self-joining.

Well, the answer is none — it’s pretty much the same thing but the point here is that whenever you are about to perform a self-join the chances are you can achieve the same result with a window function which is a more elegant way.

Avoid self-joins and use window functions instead

A self-join could increase the number of output rows which means that it’s going to degrade the query performance and also result in an increased number of bytes processed which is going to increase the cost of running such queries.

Dealing with data skewness

Data skewness is the phenomenon that appears when your data is partitioned into unevenly sized partitions. Behind the scenes, BigQuery will send these partitions into slots which are virtual CPUs used to execute SQL queries in a distributed fashion.

Therefore, partitions cannot be shared across different slots. In case you have created imbalanced partitions, this means that some slots will end up with significantly more workload than others whilst in some extreme cases, oversized partitions can even crash slots.

When you partition your table based on a key/column that contains values occurring much more frequently than others, you will most probably end up with unequally sized partitions. In such cases, applying filters early on will help you shrink this kind of imbalance.

If your data is skewed, apply filtering as early as possible

Additionally, you may also have to re-consider the partitioning key. For example, you may want to avoid partitioning a table using a key with many NULL values since this is going to create a huge partition for such rows. A commonly used partitioning key is a date field that ensures a somewhat even distribution of data across different partitions (assuming that you have roughly the same amount of data per day/month/year).

Cross-Joins

Cross-joins are used to generate the cartesian product between two tables, that is a result consisting of all possible combinations between the records of the tables involved. In more simple terms, every row from the first table will be joined to every single row in the second table which means that in the worst case scenario we’ll end up having a result consisting of MxN rows where M and N are the table sizes respectively.

Avoid the execution of joins that will result into more outputs than inputs

Therefore, this means that a cross-join will typically return more output rows than input, which is something we would usually want to avoid. As a general advise, in such cases you should consider two potential workarounds:

  1. Evaluate whether a window function — which is way more efficient than a cross join — can help you get the result you are looking for
  2. Perform a pre-aggregation using GROUP BY prior to the join

Prefer table partitioning over sharding

Table sharding is an approach used to store data into multiple different tables, using a naming prefix such as [PREFIX]_YYYYMMDD. Many users would consider the above technique the same as partitioning but in reality this is not true.

Table sharding requires from BigQuery to maintain the metadata and schemas for every single table — additionally, whenever an action is performed the platform would have to verify the permissions for all individual tables which has a significant performance impact.

Table partitioning is more efficient than table sharding

In general, table partitioning performs better and therefore you should prefer them over sharded tables. Additionally, partitioned tables are easier to handle when it comes to filtering and cost reduction.

Don’t treat BigQuery as a OLTP system

Like most Data Warehouse solutions, BigQuery is an OLAP (Online Analytical Processing) system, too. Which means that it is designed to be efficient when it comes to working with extremely large volumes of data with the use of table scans. Therefore, DML statements on BigQuery are supposed ot be used for performing bulk updates.

BigQuery is an OLAP system and needs to be treated as such

Using DML statements to perform modular changes means that you are trying to treat BigQuery as a OLTP (Online Transaction Processing) system. If that’s the case you should re-consider your design, or even the tools you are using. There’s a chance that an OLTP system (like CloudSQL on Google Cloud Platform) is more suitable. Alternatively, if your design involves regular modular inserts you may instead consider other technologies such as streaming.

For more details regarding the main differences between OLAP and OLTP systems you can read one of my latest articles.

Final Thoughts

Applying best practices and avoiding common anti-patterns in BigQuery it’s extremely important as these principles will help you improve the performance of your system as well as reduce your costs.

To summarise,

  • Avoid SELECT * and instead, make sure you query only the fields you need
  • Prefer window functions over self-joins whenever possible (e.g. if what you need to compute is row-dependent)
  • Pick partitioning keys wisely in order to avoid data skewness. Whenever this is not possible, make sure you apply filters as early as possible
  • Avoid joins that will generate more outputs than inputs
  • Prefer table partitioning over table sharding as the former is more efficient and cost-effective
  • Avoid modular DML statements — BigQuery is an OLAP system and needs to be treated as such

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Related articles you may also like


Best practices and things to avoid when running SQL on Google Cloud BigQuery

Photo by Jack Carter on Unsplash

BigQuery is the managed Data Warehouse service on Google Cloud Platform, and like most services and technologies it comes with a set of principles that one needs to have in mind whilst using it.

In the next few sections we will outline a set of best practices to avoid common anti-patterns that usually impact negatively the performance in BigQuery. Applying best practices is important mainly for two reasons — they will help you write more efficient queries and at the same time, if applied correctly, will reduce your costs.

Avoid SELECT *

Selecting all the fields from a result set is a very common anti-pattern that should be avoided whenever possible. SELECT * will result in full scans for every column in the table which means that this is going to be an expensive operation to execute.

Query only the columns you need

Also remember that LIMIT won’t reduce the volume of bytes read and therefore, you will still pay for the full scan over every single column. Therefore, ensure that you query only the columns you actually need. In case you still need to run SELECT * then consider partitioning your table such that you will be able to query data that resides in one or some of the partitions.

Avoid Self-Joins

Performing a self-join over a table is another thing you should avoid. Naturally someone would ask what’s the difference between a join of two separate tables and self-joining.

Well, the answer is none — it’s pretty much the same thing but the point here is that whenever you are about to perform a self-join the chances are you can achieve the same result with a window function which is a more elegant way.

Avoid self-joins and use window functions instead

A self-join could increase the number of output rows which means that it’s going to degrade the query performance and also result in an increased number of bytes processed which is going to increase the cost of running such queries.

Dealing with data skewness

Data skewness is the phenomenon that appears when your data is partitioned into unevenly sized partitions. Behind the scenes, BigQuery will send these partitions into slots which are virtual CPUs used to execute SQL queries in a distributed fashion.

Therefore, partitions cannot be shared across different slots. In case you have created imbalanced partitions, this means that some slots will end up with significantly more workload than others whilst in some extreme cases, oversized partitions can even crash slots.

When you partition your table based on a key/column that contains values occurring much more frequently than others, you will most probably end up with unequally sized partitions. In such cases, applying filters early on will help you shrink this kind of imbalance.

If your data is skewed, apply filtering as early as possible

Additionally, you may also have to re-consider the partitioning key. For example, you may want to avoid partitioning a table using a key with many NULL values since this is going to create a huge partition for such rows. A commonly used partitioning key is a date field that ensures a somewhat even distribution of data across different partitions (assuming that you have roughly the same amount of data per day/month/year).

Cross-Joins

Cross-joins are used to generate the cartesian product between two tables, that is a result consisting of all possible combinations between the records of the tables involved. In more simple terms, every row from the first table will be joined to every single row in the second table which means that in the worst case scenario we’ll end up having a result consisting of MxN rows where M and N are the table sizes respectively.

Avoid the execution of joins that will result into more outputs than inputs

Therefore, this means that a cross-join will typically return more output rows than input, which is something we would usually want to avoid. As a general advise, in such cases you should consider two potential workarounds:

  1. Evaluate whether a window function — which is way more efficient than a cross join — can help you get the result you are looking for
  2. Perform a pre-aggregation using GROUP BY prior to the join

Prefer table partitioning over sharding

Table sharding is an approach used to store data into multiple different tables, using a naming prefix such as [PREFIX]_YYYYMMDD. Many users would consider the above technique the same as partitioning but in reality this is not true.

Table sharding requires from BigQuery to maintain the metadata and schemas for every single table — additionally, whenever an action is performed the platform would have to verify the permissions for all individual tables which has a significant performance impact.

Table partitioning is more efficient than table sharding

In general, table partitioning performs better and therefore you should prefer them over sharded tables. Additionally, partitioned tables are easier to handle when it comes to filtering and cost reduction.

Don’t treat BigQuery as a OLTP system

Like most Data Warehouse solutions, BigQuery is an OLAP (Online Analytical Processing) system, too. Which means that it is designed to be efficient when it comes to working with extremely large volumes of data with the use of table scans. Therefore, DML statements on BigQuery are supposed ot be used for performing bulk updates.

BigQuery is an OLAP system and needs to be treated as such

Using DML statements to perform modular changes means that you are trying to treat BigQuery as a OLTP (Online Transaction Processing) system. If that’s the case you should re-consider your design, or even the tools you are using. There’s a chance that an OLTP system (like CloudSQL on Google Cloud Platform) is more suitable. Alternatively, if your design involves regular modular inserts you may instead consider other technologies such as streaming.

For more details regarding the main differences between OLAP and OLTP systems you can read one of my latest articles.

Final Thoughts

Applying best practices and avoiding common anti-patterns in BigQuery it’s extremely important as these principles will help you improve the performance of your system as well as reduce your costs.

To summarise,

  • Avoid SELECT * and instead, make sure you query only the fields you need
  • Prefer window functions over self-joins whenever possible (e.g. if what you need to compute is row-dependent)
  • Pick partitioning keys wisely in order to avoid data skewness. Whenever this is not possible, make sure you apply filters as early as possible
  • Avoid joins that will generate more outputs than inputs
  • Prefer table partitioning over table sharding as the former is more efficient and cost-effective
  • Avoid modular DML statements — BigQuery is an OLAP system and needs to be treated as such

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Related articles you may also like

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment