Techno Blender
Digitally Yours.

Explaining SQL Queries for Better Performance | by Kovid Rathee | Jun, 2022

0 69


Photo by Hanna Morris on Unsplash

DATA ENGINEERING

Peeking into the database query execution engine

One of the most common problems that data analysts and data engineers face is non-performant queries, often referred to as slow queries. These queries are slow often not because there’s a shortage of resources to process the query, but because there’s an inefficient query you’ve written that uses far more resources than it should.

Most data analysts and some data engineers don’t know much about database internals. How does one fix and optimize slow queries, then? It turns out that you don’t need to be a database guru to fix slow queries. Most database systems provide you with a way to peek into the internal working of the database by exposing how it executes queries. These are query plans.

Query optimizers create query plans. Optimizers come up with alternative plans to execute your queries to make the best use of your resources. I’ll discuss the different types of optimizers in detail in another post. Irrespective of the type of optimizer your database uses, it will follow the order of execution subscribed by most databases, as shown below:

The optimizer will look at predefined rules, table, and column usage statistics and figure out a way to better run the query. For instance, some advanced optimizers (in Spark 3.0) can also change the query plan in runtime. This adaptive way of executing queries is most useful in distributed systems where your query execution can be impacted by different nodes finishing up their work at other times.

Most databases expose the plan by letting you use a simple SQL keyword called EXPLAIN. If you execute a statement that goes EXPLAIN <SQL query>, your database will develop a plan and print it on your GUI or console. Every database has different internal terminology to signify different steps in the query execution process. The query plan usually includes the following things for every step:

  • Estimated total cost — the cost of retrieving all available records
  • Estimated number of rows scanned — number of rows scanned/examined in this step (this is where indexes, partitions, etc. come in)
  • Estimated number of rows retrieved — number of rows in the output (of the step, not the whole query)
  • Estimated average width of rows — average width of rows in the output (again, for this step only)

Keep in mind that the cost is an arbitrary number. Some databases map it to the number of database pages fetched; others do things differently. The idea of looking at a plan is to look at it as a whole — with the total cost, the number of rows fetched, the number of rows scanned, etc.

Looking at the query execution plan, you can quickly identify if your query is doing one of the following things (this list is non-exhaustive):

  • Joining tables on the wrong conditions — do you have cross joins or an inequality joins when none are needed. A database optimizer can only analyze what it sees in your SQL query; it obviously cannot understand intent. So, when you use a cross join, the database will point out to you whether you intended to use the cross join or not.
  • Missing or unused indexes — some databases provide you with advanced information in the plan where the plan clearly states that you are not using an index. In contrast, you have to infer the same information in others using the number of rows scanned and retrieved data.
  • Alternate indexes — some databases also provide you with alternate indexes you can use. Still, it is rare. Most of the time, the query optimizer can make the right decision in choosing the indexes, but there can certainly be a few blind spots for the optimizer where it cannot make the best decision.
  • Unused partitions — just like the optimizer identifies unused indexes, it can also indicate if you’re not using partitions you’ve created. Not using partitions can be even costlier as the database would have to go to each partition to search for the data you want, probably lying in one or few of the partitions.
  • Specific optimizations — all databases have different internals. They read, optimize, and execute queries differently. There might be some database-specific optimizations applied to a query enabling it to run faster. Some databases like MySQL give you an option to look at these optimizations using the EXTENDED keyword. This information can certainly be helpful for more advanced users.

Once you identify it, you can take the necessary steps to resolve the problem.

Most databases provide a way for you to look at the query execution details (estimated, if not actual), but the details differ for all the databases. Hence, the options do too. Some databases give you the option to look at the estimated execution plan, while others also allow you to look at the actual execution plan. How? By executing the query and recording the optimizer’s decision.

The EXPLAIN plans also come in different levels of verbosity and different formats. The verbosity is usually denoted by the EXTENDED keyword or the VERBOSE keyword. Here are some of the examples of different databases with their EXPLAIN usage specification:

Query execution plans are central for understanding how your queries work and fixing performance issues with your queries. EXPLAIN helps you achieve both these goals by exposing you to the execution plans. It’s up to you on how to make the best use of these plans — do you want to read them unformatted, do you want to visualize them, or do you want to store them as JSON documents for later analysis. See where this takes you. If you’re a data analyst or a data engineer, you’ve definitely got some EXPLAINing to do.

If you find my writings useful, please subscribe and check out my writings on 🌲 Linktree. You can also consider supporting me by buying a Medium Membership using my referral link.


Photo by Hanna Morris on Unsplash

DATA ENGINEERING

Peeking into the database query execution engine

One of the most common problems that data analysts and data engineers face is non-performant queries, often referred to as slow queries. These queries are slow often not because there’s a shortage of resources to process the query, but because there’s an inefficient query you’ve written that uses far more resources than it should.

Most data analysts and some data engineers don’t know much about database internals. How does one fix and optimize slow queries, then? It turns out that you don’t need to be a database guru to fix slow queries. Most database systems provide you with a way to peek into the internal working of the database by exposing how it executes queries. These are query plans.

Query optimizers create query plans. Optimizers come up with alternative plans to execute your queries to make the best use of your resources. I’ll discuss the different types of optimizers in detail in another post. Irrespective of the type of optimizer your database uses, it will follow the order of execution subscribed by most databases, as shown below:

The optimizer will look at predefined rules, table, and column usage statistics and figure out a way to better run the query. For instance, some advanced optimizers (in Spark 3.0) can also change the query plan in runtime. This adaptive way of executing queries is most useful in distributed systems where your query execution can be impacted by different nodes finishing up their work at other times.

Most databases expose the plan by letting you use a simple SQL keyword called EXPLAIN. If you execute a statement that goes EXPLAIN <SQL query>, your database will develop a plan and print it on your GUI or console. Every database has different internal terminology to signify different steps in the query execution process. The query plan usually includes the following things for every step:

  • Estimated total cost — the cost of retrieving all available records
  • Estimated number of rows scanned — number of rows scanned/examined in this step (this is where indexes, partitions, etc. come in)
  • Estimated number of rows retrieved — number of rows in the output (of the step, not the whole query)
  • Estimated average width of rows — average width of rows in the output (again, for this step only)

Keep in mind that the cost is an arbitrary number. Some databases map it to the number of database pages fetched; others do things differently. The idea of looking at a plan is to look at it as a whole — with the total cost, the number of rows fetched, the number of rows scanned, etc.

Looking at the query execution plan, you can quickly identify if your query is doing one of the following things (this list is non-exhaustive):

  • Joining tables on the wrong conditions — do you have cross joins or an inequality joins when none are needed. A database optimizer can only analyze what it sees in your SQL query; it obviously cannot understand intent. So, when you use a cross join, the database will point out to you whether you intended to use the cross join or not.
  • Missing or unused indexes — some databases provide you with advanced information in the plan where the plan clearly states that you are not using an index. In contrast, you have to infer the same information in others using the number of rows scanned and retrieved data.
  • Alternate indexes — some databases also provide you with alternate indexes you can use. Still, it is rare. Most of the time, the query optimizer can make the right decision in choosing the indexes, but there can certainly be a few blind spots for the optimizer where it cannot make the best decision.
  • Unused partitions — just like the optimizer identifies unused indexes, it can also indicate if you’re not using partitions you’ve created. Not using partitions can be even costlier as the database would have to go to each partition to search for the data you want, probably lying in one or few of the partitions.
  • Specific optimizations — all databases have different internals. They read, optimize, and execute queries differently. There might be some database-specific optimizations applied to a query enabling it to run faster. Some databases like MySQL give you an option to look at these optimizations using the EXTENDED keyword. This information can certainly be helpful for more advanced users.

Once you identify it, you can take the necessary steps to resolve the problem.

Most databases provide a way for you to look at the query execution details (estimated, if not actual), but the details differ for all the databases. Hence, the options do too. Some databases give you the option to look at the estimated execution plan, while others also allow you to look at the actual execution plan. How? By executing the query and recording the optimizer’s decision.

The EXPLAIN plans also come in different levels of verbosity and different formats. The verbosity is usually denoted by the EXTENDED keyword or the VERBOSE keyword. Here are some of the examples of different databases with their EXPLAIN usage specification:

Query execution plans are central for understanding how your queries work and fixing performance issues with your queries. EXPLAIN helps you achieve both these goals by exposing you to the execution plans. It’s up to you on how to make the best use of these plans — do you want to read them unformatted, do you want to visualize them, or do you want to store them as JSON documents for later analysis. See where this takes you. If you’re a data analyst or a data engineer, you’ve definitely got some EXPLAINing to do.

If you find my writings useful, please subscribe and check out my writings on 🌲 Linktree. You can also consider supporting me by buying a Medium Membership using my referral link.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment