Analyze performance when aggregating data in Power BI and DAX Queries | by Salvatore Cagliari | May, 2023


Photo by Isaac Smith on Unsplash

Have you ever asked yourself:

What happens behind the scenes of a Power BI Visual?

Or

How can I write a query to get the result shown in a Power BI Visual?

OK, you can catch the Query with Performance Analyzer and copy the Query in a Text Editor or, even better, in DAX Studio.

But do you understand what happens in the Query?

When you look at the function documentation for DAX, either in the Microsoft DAX function Reference or on DAX.Guide, you will find at least five functions to generate tables in a query:

In this article, I will set the scene with a Base Query. Then I will use the different functions to rebuild the Query from scratch and look at the differences between these functions.

I will look at the functional differences and the differences regarding efficiency and performance.

Let’s start with the base query.

Look at the following Matrix in Power BI:

Figure 1 — Starting Visual (Figure by the Author)

I extracted the query with Performance Analyzer and, after removing all the Subtotal Stuff needed by Power BI to calculate the totals at the Country and Continent Level, the remaining is the following:

DEFINE
VAR __DS0FilterTable =
FILTER(
KEEPFILTERS(VALUES('Date'[YearIndex]))
,AND('Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0)
)

VAR __DS0Core =
SUMMARIZECOLUMNS(
'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,__DS0FilterTable
,"Sum_Online_Sales", 'All Measures'[Sum Online Sales]
)

EVALUATE
__DS0Core

ORDER BY
'Geography'[ContinentName]
,'Geography'[RegionCountryName]

The Key function here is SUMMARIZECOLUMNS().

This function gets the distinct values from the two columns [ContinentName] and [RegionCountryName] and executes the Measure [Sum Online Sales] for each row while applying the filter defined in the Variable __DS0FilterTable.

In all the following examples, I will (try to) keep the definition of the __DS0FilterTable as shown above.

With SELECTCOLUMNS(), I can add calculated columns to an input table, for example, with a Measure.

The input table can be an existing table or the result of a table function.

Let’s try this form:

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
SELECTCOLUMNS ('Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,"Sum_Online_Sales", [Sum Online Sales]
)

This is the result of this Query:

Figure 2 — Part of the result for ADDCOLUMNS() (Figure by the Author)

As you can see, I get all rows from the Geography tables, even though I selected only two columns for the query, even those without results.

I want something else.

Another problem is that with SELECTCOLUMNS(), I cannot introduce the Filter above.

Anyway, when looking at the Server Timings, this query doesn’t look that bad:

Figure 3 — Server Timings for ADDCOLUMNS() (Figure by the Author)

Most of the work is done in the Storage Engine, and the parallelism is excellent at almost 7.5.

When copying this result to Excel, we can remove the Empty Rows without problems.

SELECTCOLUMNS() is very similar to ADDCOLUMNS().

According to DAX.guide, the difference is that SELECTCOLUMNS() starts with an empty table, then adds the given columns, while ADDCOLUMNS() starts with all input table columns.

When we try this query:

EVALUATE
ADDCOLUMNS('Geography'
,"Sum_Online_Sales", [Sum Online Sales]
)

We get a table with all columns of the Geography table, and for each row, the result of the Measure.

I need specific functions to create a table because I can define only one input table.

I will come back to this issue later in this article.

The function SUMMARIZE() allows me to get a table summarizing the given columns and adding computed columns, for example, with a Measure.

Based on the example above, the query will look like this:

EVALUATE
SUMMARIZE( 'Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,"Sum_Online_Sales", [Sum Online Sales]
)

Again, we cannot add a Filter to this query.

So, we will get the result for all years:

Figure 4 — Result with SUMMARIZE() (Figure by the Author)

But, again, I will get a list of all countries, including those without value.

The Server Timings look good as well:

Figure 5 — Server Timings for SUMMARIZE() (Figure by the Author)

But there are some issues with the SUMMARIZE() functions.

You can find the articles describing these issues in the References section below.

Now, I will show you how to complete the job with the correct form.

The SUMMARIZECOLUMNS() function combines the strengths of ADDCOLUMN() and SUMMARIZE() into one powerful function.

I can pass multiple columns to the function used as the summarization columns and add calculated columns. And I can pass a filter to the function.

Look at this example:

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
SUMMARIZECOLUMNS(
'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,__DS0FilterTable
,"Sum_Online_Sales", [Sum Online Sales]
)
ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

When you look at the query at the beginning of this article, you will find precisely this query.

The result is the following:

Figure 6 — Result with SUMMARIZECOLUMNS() (Figure by the Author)

This is the result we expected from our query.

The Server Timings are impressive:

Figure 7- Server Timings for SUMMARIZECOLUMNS() (Figure by the Author)

Done, isn’t it?

Photo by Zac Durant on Unsplash

No, wait a moment.

What if I try adding a Filter to the above query to restrict my data to one year?

It turns out that I cannot do that, and I can pass only a table as a filter to SUMMARIZECOLUMNS().

CALCULATETABLE() is different from the other three functions.

I can use CALCULATETABLE() in the same way as I use CALCULATE(). But I use a table as the first parameter instead of an aggregation function or another Measure. Then I can add filters as additional parameters.

So, let’s try to restrict the result from the last query to one year:

EVALUATE
CALCULATETABLE(SUMMARIZECOLUMNS(
'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,"Sum_Online_Sales", [Sum Online Sales]
)
,'Date'[Year] = 2023
)
ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

As you can see, I used the SUMMARIZECOLUMNS() function as the Input to CALCULATETABLE() and added a column filter to the query.

The result is the following:

Figure 8 — Result with CALCULATETABLE() (Figure by the Author)

And the Server Timings are highly efficient with only one SE Query:

Figure 9 — Server Timings for CALCULATETABLE() (Figure by the Author)

CALCULATETABLE() can combine the entire DAX Query into one SE Query, making it very efficient.

But don’t expect that CALCULATETABLE() always has the effect of improving efficiency. Later, we will see an example where this function hasn’t the same effect.

Another way to generate the needed result is to combine the functions ADDCOLUMNS() and SUMMARIZE() as described in the Article published by SQLBI (See the Reference section below).

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
ADDCOLUMNS(
SUMMARIZE('Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
)
,"Sum_Online_Sales", CALCULATE([Sum Online Sales]
,KEEPFILTERS(__DS0FilterTable)
)
)
ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

Please take notice of how I add the Measure to the Result. I use CALCULATE to include the Filter table using KEEPFILTERS(). I must do it that way, as the result will be wrong without it.

Again, please read the SQLBI Article below regarding ADDCOLUMNS() and SUMMARIZE() for the exact explanation of why this is necessary.

The values in the result are correct, but again, we see all Countries instead of only the Countries with a result:

Figure 10 — Result from combining ADDCOLUMNS with SUMMARIZE (Figure by the Author)

And when we look at the Server Timings, we see that DAX needs three SE queries to complete this query:

Figure 11 — Server Timings from combining ADDCOLUMNS() with SUMMARIZE() (Figure by the Author)

Another way to do this is to use CALULATETABLE() to introduce the filter:

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
CALCULATETABLE(
ADDCOLUMNS(
SUMMARIZE('Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
)
,"Sum_Online_Sales", [Sum Online Sales]
)
,__DS0FilterTable
)

ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

The result is still the same, and the Server Timings have not been improved.

This is proof that CALCULATETABLE() only sometimes improves efficiency. But it can make the query more readable instead of using KEEPFILTERS(), for which I still struggle to understand all effects.

SELECTCOLUMNS()/ADDCOLUMNS() is a good starting point when adding calculated columns to a table.

But I need SUMMARIZE()/SUMMARIZECOLUMNS() to summarize only selected columns and be able to add calculated columns to the result.

But SUMMARIZE() has reduced capabilities when we want to add a filter to a table expression.

In this case, SUMMARIZECOLUMNS() is the correct function to use.

Even though I need CALCULATETABLE() to add certain filter types to the query (e. g. a filter on a single column).

During my work, I always need to write queries to compare the results to the data from the source system to validate the results.

And it’s much easier to document a validation via a query and the corresponding result instead of a screenshot of all the filters set for a specific result in Power BI and export the data from the visual.

A query is helpful when you want to automate the generation of a report which should be executed automatically and sent to a user.

Plenty of use cases exist when writing a query is a good choice instead of doing it in Power BI.

I hope I have inspired you to explore the possibilities of DAX queries.

Photo by Casey Horner on Unsplash

In case you want to learn more about measuring performance in DAX Studio, read the following article:

On Articles — SQLBI, you can find more in-depth articles on these functions and why using one function over another.

For example, the issues with the SUMMARIZE() functions are documented here:

I use the Contoso sample dataset, like in my previous articles. You can download the ContosoRetailDW Dataset for free from Microsoft here.

The Contoso Data can be freely used under the MIT License, as described here.

I enlarged the dataset to make the DAX engine work harder.
The Online Sales table contains 71 million rows (instead of 12.6 million rows), and the Retail Sales table contains 18.5 million rows (instead of 3.4 million rows).


Photo by Isaac Smith on Unsplash

Have you ever asked yourself:

What happens behind the scenes of a Power BI Visual?

Or

How can I write a query to get the result shown in a Power BI Visual?

OK, you can catch the Query with Performance Analyzer and copy the Query in a Text Editor or, even better, in DAX Studio.

But do you understand what happens in the Query?

When you look at the function documentation for DAX, either in the Microsoft DAX function Reference or on DAX.Guide, you will find at least five functions to generate tables in a query:

In this article, I will set the scene with a Base Query. Then I will use the different functions to rebuild the Query from scratch and look at the differences between these functions.

I will look at the functional differences and the differences regarding efficiency and performance.

Let’s start with the base query.

Look at the following Matrix in Power BI:

Figure 1 — Starting Visual (Figure by the Author)

I extracted the query with Performance Analyzer and, after removing all the Subtotal Stuff needed by Power BI to calculate the totals at the Country and Continent Level, the remaining is the following:

DEFINE
VAR __DS0FilterTable =
FILTER(
KEEPFILTERS(VALUES('Date'[YearIndex]))
,AND('Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0)
)

VAR __DS0Core =
SUMMARIZECOLUMNS(
'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,__DS0FilterTable
,"Sum_Online_Sales", 'All Measures'[Sum Online Sales]
)

EVALUATE
__DS0Core

ORDER BY
'Geography'[ContinentName]
,'Geography'[RegionCountryName]

The Key function here is SUMMARIZECOLUMNS().

This function gets the distinct values from the two columns [ContinentName] and [RegionCountryName] and executes the Measure [Sum Online Sales] for each row while applying the filter defined in the Variable __DS0FilterTable.

In all the following examples, I will (try to) keep the definition of the __DS0FilterTable as shown above.

With SELECTCOLUMNS(), I can add calculated columns to an input table, for example, with a Measure.

The input table can be an existing table or the result of a table function.

Let’s try this form:

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
SELECTCOLUMNS ('Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,"Sum_Online_Sales", [Sum Online Sales]
)

This is the result of this Query:

Figure 2 — Part of the result for ADDCOLUMNS() (Figure by the Author)

As you can see, I get all rows from the Geography tables, even though I selected only two columns for the query, even those without results.

I want something else.

Another problem is that with SELECTCOLUMNS(), I cannot introduce the Filter above.

Anyway, when looking at the Server Timings, this query doesn’t look that bad:

Figure 3 — Server Timings for ADDCOLUMNS() (Figure by the Author)

Most of the work is done in the Storage Engine, and the parallelism is excellent at almost 7.5.

When copying this result to Excel, we can remove the Empty Rows without problems.

SELECTCOLUMNS() is very similar to ADDCOLUMNS().

According to DAX.guide, the difference is that SELECTCOLUMNS() starts with an empty table, then adds the given columns, while ADDCOLUMNS() starts with all input table columns.

When we try this query:

EVALUATE
ADDCOLUMNS('Geography'
,"Sum_Online_Sales", [Sum Online Sales]
)

We get a table with all columns of the Geography table, and for each row, the result of the Measure.

I need specific functions to create a table because I can define only one input table.

I will come back to this issue later in this article.

The function SUMMARIZE() allows me to get a table summarizing the given columns and adding computed columns, for example, with a Measure.

Based on the example above, the query will look like this:

EVALUATE
SUMMARIZE( 'Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,"Sum_Online_Sales", [Sum Online Sales]
)

Again, we cannot add a Filter to this query.

So, we will get the result for all years:

Figure 4 — Result with SUMMARIZE() (Figure by the Author)

But, again, I will get a list of all countries, including those without value.

The Server Timings look good as well:

Figure 5 — Server Timings for SUMMARIZE() (Figure by the Author)

But there are some issues with the SUMMARIZE() functions.

You can find the articles describing these issues in the References section below.

Now, I will show you how to complete the job with the correct form.

The SUMMARIZECOLUMNS() function combines the strengths of ADDCOLUMN() and SUMMARIZE() into one powerful function.

I can pass multiple columns to the function used as the summarization columns and add calculated columns. And I can pass a filter to the function.

Look at this example:

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
SUMMARIZECOLUMNS(
'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,__DS0FilterTable
,"Sum_Online_Sales", [Sum Online Sales]
)
ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

When you look at the query at the beginning of this article, you will find precisely this query.

The result is the following:

Figure 6 — Result with SUMMARIZECOLUMNS() (Figure by the Author)

This is the result we expected from our query.

The Server Timings are impressive:

Figure 7- Server Timings for SUMMARIZECOLUMNS() (Figure by the Author)

Done, isn’t it?

Photo by Zac Durant on Unsplash

No, wait a moment.

What if I try adding a Filter to the above query to restrict my data to one year?

It turns out that I cannot do that, and I can pass only a table as a filter to SUMMARIZECOLUMNS().

CALCULATETABLE() is different from the other three functions.

I can use CALCULATETABLE() in the same way as I use CALCULATE(). But I use a table as the first parameter instead of an aggregation function or another Measure. Then I can add filters as additional parameters.

So, let’s try to restrict the result from the last query to one year:

EVALUATE
CALCULATETABLE(SUMMARIZECOLUMNS(
'Geography'[ContinentName]
,'Geography'[RegionCountryName]
,"Sum_Online_Sales", [Sum Online Sales]
)
,'Date'[Year] = 2023
)
ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

As you can see, I used the SUMMARIZECOLUMNS() function as the Input to CALCULATETABLE() and added a column filter to the query.

The result is the following:

Figure 8 — Result with CALCULATETABLE() (Figure by the Author)

And the Server Timings are highly efficient with only one SE Query:

Figure 9 — Server Timings for CALCULATETABLE() (Figure by the Author)

CALCULATETABLE() can combine the entire DAX Query into one SE Query, making it very efficient.

But don’t expect that CALCULATETABLE() always has the effect of improving efficiency. Later, we will see an example where this function hasn’t the same effect.

Another way to generate the needed result is to combine the functions ADDCOLUMNS() and SUMMARIZE() as described in the Article published by SQLBI (See the Reference section below).

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
ADDCOLUMNS(
SUMMARIZE('Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
)
,"Sum_Online_Sales", CALCULATE([Sum Online Sales]
,KEEPFILTERS(__DS0FilterTable)
)
)
ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

Please take notice of how I add the Measure to the Result. I use CALCULATE to include the Filter table using KEEPFILTERS(). I must do it that way, as the result will be wrong without it.

Again, please read the SQLBI Article below regarding ADDCOLUMNS() and SUMMARIZE() for the exact explanation of why this is necessary.

The values in the result are correct, but again, we see all Countries instead of only the Countries with a result:

Figure 10 — Result from combining ADDCOLUMNS with SUMMARIZE (Figure by the Author)

And when we look at the Server Timings, we see that DAX needs three SE queries to complete this query:

Figure 11 — Server Timings from combining ADDCOLUMNS() with SUMMARIZE() (Figure by the Author)

Another way to do this is to use CALULATETABLE() to introduce the filter:

DEFINE
VAR __DS0FilterTable =
FILTER (
KEEPFILTERS ( VALUES ( 'Date'[YearIndex] ) ),
AND ( 'Date'[YearIndex] >= -3, 'Date'[YearIndex] <= 0 )
)

EVALUATE
CALCULATETABLE(
ADDCOLUMNS(
SUMMARIZE('Geography'
,'Geography'[ContinentName]
,'Geography'[RegionCountryName]
)
,"Sum_Online_Sales", [Sum Online Sales]
)
,__DS0FilterTable
)

ORDER BY 'Geography'[ContinentName]
,'Geography'[RegionCountryName]

The result is still the same, and the Server Timings have not been improved.

This is proof that CALCULATETABLE() only sometimes improves efficiency. But it can make the query more readable instead of using KEEPFILTERS(), for which I still struggle to understand all effects.

SELECTCOLUMNS()/ADDCOLUMNS() is a good starting point when adding calculated columns to a table.

But I need SUMMARIZE()/SUMMARIZECOLUMNS() to summarize only selected columns and be able to add calculated columns to the result.

But SUMMARIZE() has reduced capabilities when we want to add a filter to a table expression.

In this case, SUMMARIZECOLUMNS() is the correct function to use.

Even though I need CALCULATETABLE() to add certain filter types to the query (e. g. a filter on a single column).

During my work, I always need to write queries to compare the results to the data from the source system to validate the results.

And it’s much easier to document a validation via a query and the corresponding result instead of a screenshot of all the filters set for a specific result in Power BI and export the data from the visual.

A query is helpful when you want to automate the generation of a report which should be executed automatically and sent to a user.

Plenty of use cases exist when writing a query is a good choice instead of doing it in Power BI.

I hope I have inspired you to explore the possibilities of DAX queries.

Photo by Casey Horner on Unsplash

In case you want to learn more about measuring performance in DAX Studio, read the following article:

On Articles — SQLBI, you can find more in-depth articles on these functions and why using one function over another.

For example, the issues with the SUMMARIZE() functions are documented here:

I use the Contoso sample dataset, like in my previous articles. You can download the ContosoRetailDW Dataset for free from Microsoft here.

The Contoso Data can be freely used under the MIT License, as described here.

I enlarged the dataset to make the DAX engine work harder.
The Online Sales table contains 71 million rows (instead of 12.6 million rows), and the Retail Sales table contains 18.5 million rows (instead of 3.4 million rows).

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AggregatinganalyzeCagliariDataDAXlatest newsPerformancePowerQueriesSalvatoreTech NewsTechnoblender
Comments (0)
Add Comment