The World of SQL Query Optimization | by Kovid Rathee | Mar, 2023

By Jessie Hobb On Mar 28, 2023

DATA ENGINEERING

A peek into the different query optimizers and how they work

SQL is a simple language with very few rules, which is what makes it so popular. It is also enriched with a great number of keywords and features that allow you to interact with your data in all kinds of ways. With this flexibility comes a lot of variance in query writing styles and choices.

Once you issue a query to your database, it has to parse your query to understand its flow, but that’s not where the work of the database ends. The database engine also has a component to look at your query and, in a sense, rewrite it without changing what it does at all to allow for better performance and response times. This mighty component, unsurprisingly, is called the query optimizer.

But how would the optimizer rewrite the query? What source of extra information does it have to act on the query? These are the questions that when answered, classify different types of query optimizers. Broadly speaking, there are four kinds of query optimization techniques, as mentioned below:

Heuristics-based optimization acts based on pre-defined rules in the database, for instance, giving priority to a clustered index.
Cost-based optimization works based on query cost estimation, including estimating the cost of resources the query might use.
Hybrid optimization uses both heuristics-based and cost-based optimization techniques. Most relational databases use this technique.
Adaptive optimization is distinct from all the above techniques in that it changes the query execution plan while the query is executing. This is quite useful in distributed systems.

In the second year of my computer sciences degree, back in 2008, when I was studying database management systems, although I did have an affinity towards databases, I didn’t actually realize till much later how much query optimization and database performance tuning techniques will come in handy during my work. I can attest to the fact that understanding database-specific query optimizers has been the most fun and most useful skill for writing good, efficient SQL queries. It makes a world of difference.

Let’s look at the different types of query optimizers in a bit more detail.

Heuristics-based query optimization, also called rule-based query optimization, uses a set of rules of thumb and core principles to determine the most efficient way to execute a query by rearranging the relational algebra operations that make a query tree (an internal representation of the query).

I’m sure you’ve heard of the popular discussion among bugging SQL engineers: does the order in which I join tables matter? Well, the short answer to this question is both yes and no. Yes, because the order matters, but no, because the optimizer fixes the query after parsing it.

Let’s take the most popular example of optimizing queries — using the WHERE clause. Now, it’s a rule of thumb that you should make the selection as early as possible. This is because if you filter out the data that you don’t need early in the process, you’ll also limit the data that you need to join, order, and group.

Several other rules around certain types of joins, CTEs, sorting, etc., can be called heuristics, hence the name heuristics-based optimization. Optimizing a SQL query is not just about relational algebra and heuristics. It matters which database you are using and what flavor of SQL you are using, among other things.

Note that this method of query optimization is purely theoretical and doesn’t consider a lot of the real-world information from database statistics, which is why you won’t see databases using this type of optimization alone to optimize your SQL queries. They’ll also look into cost-based query optimization, which we’ll discuss in the next section.

This method takes a more practical approach to optimization. It takes into consideration the constraints of memory, CPU, network, etc., along with the data statistics that the database maintains, such as row count, average row size, table size, and so on.

While heuristics-based optimization reorders the query tree based on rules of thumb, cost-based optimization involves evaluating several query execution plans and choosing the best out of them based on the estimated cost of running the query. This cost includes everything from the estimated time, the number of rows scanned, the number of indexes used, and more.

As cost-based optimization evaluates many query execution plans, it might take more time to come up with the right execution plan. The optimizer might also be slowed down because of the overhead of collecting and analyzing statistics from the database. Still, it usually turns out to be more efficient than heuristics-based optimization.

Most databases don’t use a cost-based optimizer alone. They usually go for a hybrid approach, with most of the optimization based on cost, while the rest is based on heuristics, user hints, etc. Jiaqi Yan of Snowflake talks about this in one of the CMU DB Quarantine Tech Talks sessions. In the next section, with the help of a few examples, let’s see how hybrid query optimization works.

A hybrid query optimization approach just takes both heuristics-based optimization and cost-based optimization into consideration and comes up with a balanced query optimization plan. We won’t go into the details of how hybrid query optimization works because it wildly differs from one database to another.

Till now, we discussed query optimization techniques that only looked at and improved the query execution plan before the start of query execution. Adaptive query optimization is the type of query optimization that tries to improve the plan as the query executes. The use case for this might be obvious to you — distributed data processing systems and query engines that deal with large amounts of data and a high level of parallelism.

Adaptive query optimization pays off really well when you are dealing with long-running queries, the processing tasks for which have been divided into multiple jobs allocated to different compute nodes in your cluster (whatever the tech might be — Redshift, Snowflake, Databricks, and so on). A significant skew in the data, bad partitioning, and file formats, among other things, can slow down your queries.

Adaptive query optimization looks at some of these problems and more while your query is executing. It also looks at how it was supposed to be performing and how it is actually performing with the help of a number of metrics. With such a system in place, the query engine can individual query plan workflows or the whole query plan while the query is executing.

More and more databases, especially the ones that have distributed data processing capabilities, will end up using some version of adaptive query optimization because it solves some of the most time-consuming performance optimization problems like ensuring even data distribution, ensuring the optimal partition size based on your memory and compute availability, and so on. I’d highly recommend going through the session from Databricks Data+AI 2020 I’ve shared in the references. It has some practical insights into how adaptive query optimization is implemented internally.

This article explored the different types of query optimizers used in various databases, data warehouses, and data processing engines. The point of being aware of the kind of optimizer is that it is central to the query optimization and performance tuning aspect of data engineering. The more you scale, the harder it becomes to tune. With an in-depth knowledge of the optimizer, you will be able to understand how it works and how to manipulate its output to suit your needs and requirements.

Think of the ideas around any other type of space and time optimization in applications, such as garbage collection, caching, etc. Understanding query optimizers bring the same benefit to data engineers that understanding the aforementioned concepts brings to software developers.

If you found this article useful, please subscribe and check out all my writings on 🌲 Linktree. You can also consider supporting me by buying a Medium Membership using my referral link. Cheers!

DATA ENGINEERING

A peek into the different query optimizers and how they work

Heuristics-based optimization acts based on pre-defined rules in the database, for instance, giving priority to a clustered index.
Cost-based optimization works based on query cost estimation, including estimating the cost of resources the query might use.
Hybrid optimization uses both heuristics-based and cost-based optimization techniques. Most relational databases use this technique.
Adaptive optimization is distinct from all the above techniques in that it changes the query execution plan while the query is executing. This is quite useful in distributed systems.

Let’s look at the different types of query optimizers in a bit more detail.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.