Techno Blender
Digitally Yours.

BigQuery Optimization Strategies 3: Table Flattening | by Martin Weitzmann | Feb, 2023

0 24


Arrays are one of the coolest features for an analytics databases you can think of, because it can store additional information right at the place and time it happened. Let’s explore some basic examples and then have a look at arrays in Google Analytics 4.

Photo by Torsten Dederichs on Unsplash

For storing sales history, for example, we can just store the products bought in an array together with the purchase event and not in a separate table — it’s better to save all the SQL join hassle later in the analyses.

And while arrays are not super intuitive, I would argue that SQL joins are worse (mostly because people use the wrong mental model to explain them).

In terms of query optimization, there is a simple rule for dealing with arrays in SQL:

  • If you can aggregate the array — do it!

Let me explain — when you need information from an array you have two choices:

  • aggregate the array / reduce it to one value
  • lateral join the array to the table / “flatten” the table

If you need only one value, then aggregate — if you really need multiple values then filter, filter and filter again, then do the expensive lateral join.

By the way — if you want a more elaborate refresher on querying nested data, please have a read here:

Since we want to focus on optimization, here is only a short version:

This means to reduce/implode your array to one value, i.e.

  • select one value from the array: my_array[OFFSET(2)]
  • use a function that aggregates: ARRAY_LENGTH(my_array)
  • use a sub-query for custom aggregation (select sum(x) from unnest([3, 6, 7, 2]) as x)

We’ll look at more examples later!

These joins are lateral because they combine a table with something that changes and is unique for every row. In our case it’s an array — it usually contains different values in every row.

With unnesting arrays this just means that you take every value in the array and add it to a copy of the parent row. So we go from two rows …

  1. [3, 6, 8]
  2. [2, 7]

to 5 rows …

  1. 3, [3, 6, 8]
  2. 6, [3, 6, 8]
  3. 8, [3, 6, 8]
  4. 2, [2, 7]
  5. 7, [2, 7]

We got the elements out of their array, but we also got from 2 rows to 5, a 150% increase!

Source: https://giphy.com/channel/ThisIsAOK

Now imagine what happens in a table with more than two rows — it explodes in size and someone has to pay for this intense calculation!

We should really try to avoid these kinds of lateral joins if we can! And that is what you need to check in your use-case: Can I avoid a lateral join with an array? or maybe better Can I avoid a table explosion?

Comparing an avoidable lateral cross join implementation with its array aggregation counter-part just means to push pre-aggregation onto the array level while completely dropping the join operation. This saves resources both in terms of computation as well as cache needed (that might spill to disc and further slow down the query execution).

Let’s start with a table with two food items containing the name and ingredients:

  1. Pasta al Pomodoro, [pasta, tomatoes, olive oil, garlic, basil, salt]
  2. Kimchi, [Napa cabbage, Korean pepper flakes, daikon radish, garlic, ginger, salt]

Here are our prompts:

  • Find the number of daikon radishes in all foods!
  • Find all ingredients that contain both letters ‘a’ and ‘i’
  • Find the number of ingredients that contain both letters ‘a’ and ‘i’

All three cases can be solved by just flattening the table and then do the grouping and/or counting. But let’s actually have a closer look — here is the table as CTE:

WITH foods AS (
SELECT 'Pasta al Pomodoro' as name, ['pasta', 'tomatoes', 'olive oil', 'garlic', 'basil', 'salt'] as ingredients
UNION ALL
SELECT 'Kimchi' as name, ['Napa cabbage', 'Korean pepper flakes', 'daikon radish', 'garlic', 'ginger', 'salt'] as ingredients
)

SELECT
*
FROM
foods

Yummy Query Result (Screenshot by author)

Just paste the query into your console and follow along.

Number of daikons in all foods

Can we aggregate?

Sure — in theory we just need one information from the array: how many daikons does it contain? And then just sum up the result over all rows.

Let’s start with adding the information to each row — I put the comment numbering to show the order of execution:

SELECT 
*,
( --3. aggregate with count and feed to parent row using 'select'
select count(1)
--1. turn array into table format using unnest()
from unnest(ingredients) as ingr
--2. filter for right conditions
where ingr like '%daikon%'

) AS qty_daikon_ingredients
FROM
foods

Amazingly Aggregated Arrays! (Screenshot by Author)

Now that we have this information in the 3rd column, we can just sum it up in the parent query! So, we basically just wrap the sub-query into a SUM() and remove the wildcard *

SELECT 
SUM( --4. aggregate sub-query results over the whole table
( select count(1) --3. aggregate with count and feed to parent row
from unnest(ingredients) as ingr --1. turn array into table format using unnest
where ingr like '%daikon%' --2. filter for right conditions
)
) AS qty_daikon_ingredients
FROM
foods

… and we end up with the result 1. Try it with garlic and other ingredients!

And yes, the most confusing thing about SQL is that the order of reading the query doesn’t match the order of execution at all. But at least it’s consistent and you know exactly what a queries output is like.

Find all ingredients that contain both letters ‘a’ and ‘i’

Can we aggregate? Hm, difficult, because we might have more than one ingredient in the array and we might want to group by all ingredients with ‘a’ and ‘i’ — so we need to “flatten” the table first …

SELECT 
*
FROM
foods cross join unnest(ingredients) as ingr
-- or if you want the same FROM even shorter:
-- FROM foods, foods.ingredients as ingr
-- comma = cross join; <table>.<array> = contextually implied unnest operation

Nice and short query … but the table just grew from 2 to 12 entries!

Table Explosion Action a.k.a. Lateral Cross Join (Screenshot by author)

Luckily, BigQuery works columnar and we don’t need the array anymore. We can also filter for the actual condition and so we end up with:

SELECT 
name, ingr -- 3. smooth output
FROM
foods cross join unnest(ingredients) as ingr -- 1. nasty, nasty
WHERE
ingr like '%a%' and ingr like '%i%' -- 2. filter, filter!
Nice output — the nasty cross join is hidden in the back-end (Screenshot by author)

Now we can use this output to group and summarize and get more information — but we couldn’t have done it without the cross join! You see that we changed the table meaning:

  • Before: every row represents a food
  • After: every row represents an ingredient

If you do “flatten” your table, be aware of how the meaning of your table changed!

Find the number of ingredients that contain both letters ‘a’ and ‘i’

Now this is awfully similar to the one we just had, no? You guessed it, we don’t need the cross join for this one anymore! Because we can aggregate the array, right? We just do a pre-count inside the array and sum up the results!

SELECT 
name,
(select count(1)
from unnest(ingredients) as ingr
where ingr like '%a%'
and ingr like '%i%'
) as qty_ingredients_a_i
FROM
foods
Nice output — no nasty cross join needed! (Screenshot by author)

Of course, you can again wrap a SUM() around the sub-query and get a table aggregate. But let’s look at some real-world scenarios.

With the schema we’re confronted with here, we got two kinds of arrays:

  • Store custom variables and their value (non-hierarchical): event_params, user_properties
  • Provide hierarchical information: items

Custom variables are semantically on the same level as the parent row, we just can’t introduce new columns and so we store them in an array. Items are semantically under the event. E.g. in the items view event we want to know the concrete items that have been viewed by a user.

For the custom variables you will probably almost always use a sub-query because you only want one variable at a time. It’s more of a challenge with the second case: items. Let’s have a look at this exploratory query:

Let’s find events that contain items so we understand better what data is stored in the wild …

SELECT 
event_name,
-- classic example to get one variable from a set of custom variables
( select value.string_value
from unnest(event_params)
where key='page_title'
) as page_title,
items -- just show the whole plain array
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
WHERE
ARRAY_LENGTH(items)>0 -- simple but effective aggregation
LIMIT 1000
(Screenshot by author)

Let’s assume we want to count the amount of viewed items that had the color ‘Charcoal’ in their name.

SELECT 
-- find parameter with name page_title and return its string value
(select value.string_value from unnest(event_params) where key='page_title') as page_title,

-- find count of products with 'Charcoal' in their name
(select count(1) from unnest(items) where item_name like '%Charcoal%') as charcoal_items,

items -- just to check if we're doing it right

FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
WHERE
ARRAY_LENGTH(items)>0
AND event_name='view_item' -- put the event name as a condition
LIMIT 1000

Sub-query count and visual count seem to match (screenshot by author)

Looks good! Let’s wrap it up into a sum of the count and group by page_title!

-- data preparation CTE
WITH prep AS (
SELECT
-- find parameter with name page_title and return its string value
(select value.string_value from unnest(event_params) where key='page_title') as page_title,

-- find count of products with 'Charcoal' in their name
(select count(1) from unnest(items) where item_name like '%Charcoal%') as charcoal_items,

-- to build a share we need a count of all items
ARRAY_LENGTH(items) as all_items

FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
WHERE
ARRAY_LENGTH(items)>0
AND event_name='view_item'
)

SELECT
page_title,
-- build the share
ROUND(
SUM(charcoal_items) / SUM(all_items) *100
, 2)as share_viewed_charcoal_items
FROM prep
GROUP BY 1
ORDER BY 2 DESC

Charcoal for the Kids! (screenshot by author)

The kids section provides the most Charcoal item views … that must be important information for someone!

Anyways, I think we will be happy to only have used sub-queries here instead of cross joins when we scale this analysis up to year-on-year comparison for the last 6 months globally while we continue to add more really important metrics that management needs urgently to make very important decisions.

I’ll leave it to you to develop good queries for lateral cross joins in GA4, but the only use-cases I can think of is some item dimension that you need to group by — like item_category.


Arrays are one of the coolest features for an analytics databases you can think of, because it can store additional information right at the place and time it happened. Let’s explore some basic examples and then have a look at arrays in Google Analytics 4.

Photo by Torsten Dederichs on Unsplash

For storing sales history, for example, we can just store the products bought in an array together with the purchase event and not in a separate table — it’s better to save all the SQL join hassle later in the analyses.

And while arrays are not super intuitive, I would argue that SQL joins are worse (mostly because people use the wrong mental model to explain them).

In terms of query optimization, there is a simple rule for dealing with arrays in SQL:

  • If you can aggregate the array — do it!

Let me explain — when you need information from an array you have two choices:

  • aggregate the array / reduce it to one value
  • lateral join the array to the table / “flatten” the table

If you need only one value, then aggregate — if you really need multiple values then filter, filter and filter again, then do the expensive lateral join.

By the way — if you want a more elaborate refresher on querying nested data, please have a read here:

Since we want to focus on optimization, here is only a short version:

This means to reduce/implode your array to one value, i.e.

  • select one value from the array: my_array[OFFSET(2)]
  • use a function that aggregates: ARRAY_LENGTH(my_array)
  • use a sub-query for custom aggregation (select sum(x) from unnest([3, 6, 7, 2]) as x)

We’ll look at more examples later!

These joins are lateral because they combine a table with something that changes and is unique for every row. In our case it’s an array — it usually contains different values in every row.

With unnesting arrays this just means that you take every value in the array and add it to a copy of the parent row. So we go from two rows …

  1. [3, 6, 8]
  2. [2, 7]

to 5 rows …

  1. 3, [3, 6, 8]
  2. 6, [3, 6, 8]
  3. 8, [3, 6, 8]
  4. 2, [2, 7]
  5. 7, [2, 7]

We got the elements out of their array, but we also got from 2 rows to 5, a 150% increase!

Source: https://giphy.com/channel/ThisIsAOK

Now imagine what happens in a table with more than two rows — it explodes in size and someone has to pay for this intense calculation!

We should really try to avoid these kinds of lateral joins if we can! And that is what you need to check in your use-case: Can I avoid a lateral join with an array? or maybe better Can I avoid a table explosion?

Comparing an avoidable lateral cross join implementation with its array aggregation counter-part just means to push pre-aggregation onto the array level while completely dropping the join operation. This saves resources both in terms of computation as well as cache needed (that might spill to disc and further slow down the query execution).

Let’s start with a table with two food items containing the name and ingredients:

  1. Pasta al Pomodoro, [pasta, tomatoes, olive oil, garlic, basil, salt]
  2. Kimchi, [Napa cabbage, Korean pepper flakes, daikon radish, garlic, ginger, salt]

Here are our prompts:

  • Find the number of daikon radishes in all foods!
  • Find all ingredients that contain both letters ‘a’ and ‘i’
  • Find the number of ingredients that contain both letters ‘a’ and ‘i’

All three cases can be solved by just flattening the table and then do the grouping and/or counting. But let’s actually have a closer look — here is the table as CTE:

WITH foods AS (
SELECT 'Pasta al Pomodoro' as name, ['pasta', 'tomatoes', 'olive oil', 'garlic', 'basil', 'salt'] as ingredients
UNION ALL
SELECT 'Kimchi' as name, ['Napa cabbage', 'Korean pepper flakes', 'daikon radish', 'garlic', 'ginger', 'salt'] as ingredients
)

SELECT
*
FROM
foods

Yummy Query Result (Screenshot by author)

Just paste the query into your console and follow along.

Number of daikons in all foods

Can we aggregate?

Sure — in theory we just need one information from the array: how many daikons does it contain? And then just sum up the result over all rows.

Let’s start with adding the information to each row — I put the comment numbering to show the order of execution:

SELECT 
*,
( --3. aggregate with count and feed to parent row using 'select'
select count(1)
--1. turn array into table format using unnest()
from unnest(ingredients) as ingr
--2. filter for right conditions
where ingr like '%daikon%'

) AS qty_daikon_ingredients
FROM
foods

Amazingly Aggregated Arrays! (Screenshot by Author)

Now that we have this information in the 3rd column, we can just sum it up in the parent query! So, we basically just wrap the sub-query into a SUM() and remove the wildcard *

SELECT 
SUM( --4. aggregate sub-query results over the whole table
( select count(1) --3. aggregate with count and feed to parent row
from unnest(ingredients) as ingr --1. turn array into table format using unnest
where ingr like '%daikon%' --2. filter for right conditions
)
) AS qty_daikon_ingredients
FROM
foods

… and we end up with the result 1. Try it with garlic and other ingredients!

And yes, the most confusing thing about SQL is that the order of reading the query doesn’t match the order of execution at all. But at least it’s consistent and you know exactly what a queries output is like.

Find all ingredients that contain both letters ‘a’ and ‘i’

Can we aggregate? Hm, difficult, because we might have more than one ingredient in the array and we might want to group by all ingredients with ‘a’ and ‘i’ — so we need to “flatten” the table first …

SELECT 
*
FROM
foods cross join unnest(ingredients) as ingr
-- or if you want the same FROM even shorter:
-- FROM foods, foods.ingredients as ingr
-- comma = cross join; <table>.<array> = contextually implied unnest operation

Nice and short query … but the table just grew from 2 to 12 entries!

Table Explosion Action a.k.a. Lateral Cross Join (Screenshot by author)

Luckily, BigQuery works columnar and we don’t need the array anymore. We can also filter for the actual condition and so we end up with:

SELECT 
name, ingr -- 3. smooth output
FROM
foods cross join unnest(ingredients) as ingr -- 1. nasty, nasty
WHERE
ingr like '%a%' and ingr like '%i%' -- 2. filter, filter!
Nice output — the nasty cross join is hidden in the back-end (Screenshot by author)

Now we can use this output to group and summarize and get more information — but we couldn’t have done it without the cross join! You see that we changed the table meaning:

  • Before: every row represents a food
  • After: every row represents an ingredient

If you do “flatten” your table, be aware of how the meaning of your table changed!

Find the number of ingredients that contain both letters ‘a’ and ‘i’

Now this is awfully similar to the one we just had, no? You guessed it, we don’t need the cross join for this one anymore! Because we can aggregate the array, right? We just do a pre-count inside the array and sum up the results!

SELECT 
name,
(select count(1)
from unnest(ingredients) as ingr
where ingr like '%a%'
and ingr like '%i%'
) as qty_ingredients_a_i
FROM
foods
Nice output — no nasty cross join needed! (Screenshot by author)

Of course, you can again wrap a SUM() around the sub-query and get a table aggregate. But let’s look at some real-world scenarios.

With the schema we’re confronted with here, we got two kinds of arrays:

  • Store custom variables and their value (non-hierarchical): event_params, user_properties
  • Provide hierarchical information: items

Custom variables are semantically on the same level as the parent row, we just can’t introduce new columns and so we store them in an array. Items are semantically under the event. E.g. in the items view event we want to know the concrete items that have been viewed by a user.

For the custom variables you will probably almost always use a sub-query because you only want one variable at a time. It’s more of a challenge with the second case: items. Let’s have a look at this exploratory query:

Let’s find events that contain items so we understand better what data is stored in the wild …

SELECT 
event_name,
-- classic example to get one variable from a set of custom variables
( select value.string_value
from unnest(event_params)
where key='page_title'
) as page_title,
items -- just show the whole plain array
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
WHERE
ARRAY_LENGTH(items)>0 -- simple but effective aggregation
LIMIT 1000
(Screenshot by author)

Let’s assume we want to count the amount of viewed items that had the color ‘Charcoal’ in their name.

SELECT 
-- find parameter with name page_title and return its string value
(select value.string_value from unnest(event_params) where key='page_title') as page_title,

-- find count of products with 'Charcoal' in their name
(select count(1) from unnest(items) where item_name like '%Charcoal%') as charcoal_items,

items -- just to check if we're doing it right

FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
WHERE
ARRAY_LENGTH(items)>0
AND event_name='view_item' -- put the event name as a condition
LIMIT 1000

Sub-query count and visual count seem to match (screenshot by author)

Looks good! Let’s wrap it up into a sum of the count and group by page_title!

-- data preparation CTE
WITH prep AS (
SELECT
-- find parameter with name page_title and return its string value
(select value.string_value from unnest(event_params) where key='page_title') as page_title,

-- find count of products with 'Charcoal' in their name
(select count(1) from unnest(items) where item_name like '%Charcoal%') as charcoal_items,

-- to build a share we need a count of all items
ARRAY_LENGTH(items) as all_items

FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
WHERE
ARRAY_LENGTH(items)>0
AND event_name='view_item'
)

SELECT
page_title,
-- build the share
ROUND(
SUM(charcoal_items) / SUM(all_items) *100
, 2)as share_viewed_charcoal_items
FROM prep
GROUP BY 1
ORDER BY 2 DESC

Charcoal for the Kids! (screenshot by author)

The kids section provides the most Charcoal item views … that must be important information for someone!

Anyways, I think we will be happy to only have used sub-queries here instead of cross joins when we scale this analysis up to year-on-year comparison for the last 6 months globally while we continue to add more really important metrics that management needs urgently to make very important decisions.

I’ll leave it to you to develop good queries for lateral cross joins in GA4, but the only use-cases I can think of is some item dimension that you need to group by — like item_category.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment