Techno Blender
Digitally Yours.

Slowly Changing Dimension Type 2 with Google BigQuery | by Cristian Saavedra Desmoineaux | Nov, 2022

0 54


Step-by-Step implementation of SCD Type 2 using SQL MERGE

Source Image from unsplash.com

It is important to model data in a way that allows managing changes to have a quick answer for questions like:

  • When did the change happen?
  • Is it still valid?
  • How was it before?
  • What is the final state?

For example, if you want to track your spending for the last few years and then you begin to notice that each year you change the classification if you do the report by month, it is not going to match with the real spending by year but if you have a proper classification dimension were split the changes you can do it.

In this post, we will learn how to hack the missing OPTION in the MERGE BigQuery implementation to create a Slowly Changing Dimension Type 2, and I am going to explain in more detail the following:

  1. The Set-Up
  2. A brief introduction to SCD type 2
  3. Step-by-Step Execution
  4. Understanding the solution

And I am finally adding Conclusions, Thanks, Troubleshooting, and Helpful Resources.

If you are working with Google Query, I recommend you check my previous posts about Run BigQuery SQL using Python API Client and Connecting DBeaver to Google BigQuery because they have detailed step by step how to create and configure Service Account and how to avoid the BigQuery Web Interface.

We will use two tables, HR_INPUT, as the input that it is changing over time and EmployeeDim for Dimension Type 2:

CREATE OR REPLACE TABLE SCD.HR_INPUT (
ID STRING NOT NULL
,EMPLOYEE STRING NOT NULL
,JOB_TITLE STRING NOT NULL
,COMPANY STRING NOT NULL
,START_YEAR INTEGER NOT NULL
);
CREATE OR REPLACE TABLE SCD.EmployeeDim (
SKEY STRING NOT NULL
,ID STRING NOT NULL
,EMPLOYEE STRING NOT NULL
,JOB_TITLE STRING NOT NULL
,COMPANY STRING NOT NULL
,START_YEAR INTEGER NOT NULL
,END_YEAR INTEGER NOT NULL
,ACTIVE STRING NOT NULL
,CREATED TIMESTAMP NOT NULL
,UPDATED TIMESTAMP NOT NULL
);

In the HR Input Table, you will get the current employer, and we will simulate that we run three times in different years:

Example from the Author

The final table for Employee Dim will look like this:

Final SCD.EmployeeDim result query by the Author using Google BigQuery Web UI

Also, we are going to use the Storage Procedure SP_EmployeeDim_SCD2 to process the data using MERGE:

CREATE OR REPLACE PROCEDURE SCD.SP_EmployeeDim_SCD2()
BEGIN
MERGE SCD.EmployeeDim AS output
USING (
SELECT src.ID as PSEUDO_ID, src.*
FROM SCD.HR_INPUT AS src
UNION ALL
SELECT NULL as PSEUDO_ID, dup.*
FROM SCD.HR_INPUT AS dup
INNER JOIN SCD.EmployeeDim AS trget ON dup.ID = trget.ID
WHERE trget.END_YEAR = 9999
AND trget.START_YEAR <> dup.START_YEAR
) AS input
ON input.PSEUDO_ID = output.ID
WHEN NOT MATCHED THEN INSERT (SKEY,ID,EMPLOYEE,JOB_TITLE,COMPANY,START_YEAR
,END_YEAR,ACTIVE,CREATED,UPDATED)
VALUES ( GENERATE_UUID()
,input.ID
,input.EMPLOYEE
,input.JOB_TITLE
,input.COMPANY
,input.START_YEAR
,9999
,'Y'
,CURRENT_TIMESTAMP()
,CURRENT_TIMESTAMP()
)
WHEN MATCHED
AND output.END_YEAR = 9999
AND output.START_YEAR <> input.START_YEAR THEN
UPDATE SET ACTIVE = 'N'
,END_YEAR = input.START_YEAR
,UPDATED = CURRENT_TIMESTAMP()
;
END;

In Data Modelling, the Slowly Changing Dimensions are an essential part of implementing the tracking of the historical changes in a Dimension table.

The beauty of SCD Type 2 is that it allows us to see the data as It was when it happened and see it as currently active.

For example, suppose you have financial categorization and the spending related to each transaction. In that case, it is easy to fall into a war between departments if we don’t track the changes properly.

Using SCD Type 2, we can see the full year spent with the last version of categorization, but also we can check how it was defined month by month, which helps us to explain why an area is spending more money.

First, let’s delete all the data in both tables for a clean start:

TRUNCATE TABLE SCD.EmployeeDim;
TRUNCATE TABLE SCD.HR_INPUT;

Second, we are going to assume we were in 2008, and the data source reflects my current job as Data Warehouse Architect using the following INSERT statement:

INSERT INTO SCD.HR_INPUT VALUES ('123456','Cristian Saavedra Desmoineaux','Data Warehouse Architect','MILLICOM/TIGO',2008);SELECT * FROM SCD.HR_INPUT;
SCD.HR_INPUT result query by the Author using Google BigQuery Web UI

Then we will call the Stored Procedure SP_EmployeeDim_SCD2 to insert the first row in the Employee Dim table starting in 2008 and ending in 9999:

CALL SCD.SP_EmployeeDim_SCD2();SELECT * FROM SCD.EmployeeDim;
SCD.EmployeeDim result query by the Author using Google BigQuery Web UI

Third, we will assume we were in 2009, and the input data didn’t change it:

CALL SCD.SP_EmployeeDim_SCD2();

As you see, the EmployeeDim table didn’t change too:

SCD.EmployeeDim result query after running the Procedure for the second time by the Author using Google BigQuery Web UI

Fourth, we will assume we were in 2021, and the Data Source shows my current job when I was working in REEF Technologies, and we run the Stored procedure SP_EmployeeDim_SCD2:

TRUNCATE TABLE SCD.HR_INPUT;INSERT INTO SCD.HR_INPUT VALUES ('123456','Cristian Saavedra Desmoineaux','Senior Business Intelligence Engineer - Manager II','REEF Technology',2021);SELECT * FROM SCD.HR_INPUT;CALL SCD.SP_EmployeeDim_SCD2();SELECT * FROM SCD.EmployeeDim;
SCD.HR_INPUT result query after the second change by the Author using Google BigQuery Web UI

Now we have two records. The previous record changed to be valid from 2008 to 2021, and It is inactive:

SCD.EmployeeDim result query after running the Procedure for the third time by the Author using Google BigQuery Web UI

Fifth, now we are in 2022, and the HR Input table reflects my current role:

TRUNCATE TABLE SCD.HR_INPUT;INSERT INTO SCD.HR_INPUT VALUES ('123456','Cristian Saavedra Desmoineaux','Manager, Data & Analytics','Cardinal Health',2022);SELECT * FROM SCD.HR_INPUT;CALL SCD.SP_EmployeeDim_SCD2();SELECT * FROM SCD.EmployeeDim;

We see only one record in the HR Input table:

SCD.HR_INPUT result query after the third change by the Author using Google BigQuery Web UI

And three records in the Employee Dimension, showing my current one as Active and the rest as inactive:

SCD.EmployeeDim result query after running the Procedure for the ford time by the Author using Google BigQuery Web UI

First, let’s take a look at the SP_EmployeeDim_SCD2 Stored Procedure:

Stored Procedure created by the Author using Google BigQuery Web UI

As you can see, there is a MERGE statement that says:

Try to MERGE the Dimension with the Data Source and If the pseudo ID doesn’t match then Insert a New row else Update as inactive and add the end of the year

MERGE SCD.EmployeeDim AS output 
USING (<... subquery ...>) AS input
ON input.PSEUDO_ID = output.ID
WHEN NOT MATCHED THEN
INSERT (<... columns ...>)
VALUES (GENERATE_UUID(),<... all data ...>,9999,'Y'
,CURRENT_TIMESTAMP(),CURRENT_TIMESTAMP())
WHEN MATCHED AND output.END_YEAR = 9999
AND output.START_YEAR <> input.START_YEAR THEN
UPDATE SET ACTIVE = 'N', END_YEAR = input.START_YEAR, UPDATED = CURRENT_TIMESTAMP()

When It doesn’t match, we bring all the data from the HR Input Table, add a Surrogate Key using the GENERATE_UUID() function, add an infinity ending year, mark it as active, and set the timestamp.

WHEN NOT MATCHED THEN
INSERT(SKEY,ID,EMPLOYEE,JOB_TITLE,COMPANY,START_YEAR,END_YEAR,ACTIVE
,CREATED,UPDATED) VALUES (GENERATE_UUID(),input.ID,input.EMPLOYEE
,input.JOB_TITLE,input.COMPANY,input.START_YEAR,9999,'Y'
,CURRENT_TIMESTAMP(),CURRENT_TIMESTAMP())

When we found it, we updated the ones that were still open for the future and had different starting years, and we set them as inactive, adding an ending year and updating the timestamp.

WHEN MATCHED AND output.END_YEAR = 9999 
AND output.START_YEAR <> input.START_YEAR THEN
UPDATE SET ACTIVE = 'N', END_YEAR = input.START_YEAR
, UPDATED = CURRENT_TIMESTAMP()

Initially, I was trying to implement the Kimball Design Tip #107 when I discovered that Google BigQuery hadn’t implemented OPTION as I saw it more than 14 years ago. Then I found It is possible changing the source:

USING (
SELECT src.ID as PSEUDO_ID, src.*
FROM SCD.HR_INPUT AS src
UNION ALL
SELECT NULL as PSEUDO_ID, dup.*
FROM SCD.HR_INPUT AS dup
INNER JOIN SCD.EmployeeDim AS trget ON dup.ID = trget.ID
WHERE trget.END_YEAR = 9999 AND trget.START_YEAR <> dup.START_YEAR
) AS input

I am bringing all the data, duplicating the ID column, and calling it Pseudo ID, where new records and the values that split existing IDs by year will come true.

SELECT src.ID as PSEUDO_ID, src.*
FROM SCD.HR_INPUT AS src

And I add all the records from the Input table that will add it for the existing IDs as active, adding a NULL in the Pseudo ID to the MERGE statement categorized as the ones to Insert.

SELECT NULL as PSEUDO_ID, dup.*
FROM SCD.HR_INPUT AS dup
INNER JOIN SCD.EmployeeDim AS trget ON dup.ID = trget.ID
WHERE trget.END_YEAR = 9999 AND trget.START_YEAR <> dup.START_YEAR

Slowly Changing Dimension Type 2 allows us to see the data as It was when it happened and see it as currently active, and the MERGE Statement can simplify the creation of SCD type 2 but implement it in Google BigQuery; we need to make a change in the Source query to replace OPTION that is currently not available.

If you are working with Google Query, I recommend you check my previous post about Run BigQuery SQL using Python API Client and Connecting DBeaver to Google BigQuery because they have detailed step by step how to create and configure Service Account and how to avoid the BigQuery Web Interface.

I dedicate this post to the memory of Warren Thornthwaite (Kimball Group), a fantastic person and knowledgeable professional who, until his last days, was smiling.

His passion for Data Modelling showed me the path to growth in my professional career when we worked together.

To my friend, I am still the Data magician as you call me.

Error validating procedure body (add OPTIONS(strict_mode=false) to suppress): Query error: Billing has not been enabled for this project. Enable billing at https://console.cloud.google.com/billing. DML queries are not allowed in the free tier. Set up a billing account to remove this restriction. at [3:5]

MERGE Error message Google BigQuery Free Tier using Google BigQuery Web UI

If you are working with a Free Tier, you need to change to a billable project by going to https://console.cloud.google.com/billing and creating an account, setting up a payment method, and then going to my projects, selecting your project, and changing billing.

Remember to go back after you test and lock or disable billing again.

Console Google Cloud Billing to create an account. Screenshot by the Author using Google BigQuery Web UI


Step-by-Step implementation of SCD Type 2 using SQL MERGE

Source Image from unsplash.com

It is important to model data in a way that allows managing changes to have a quick answer for questions like:

  • When did the change happen?
  • Is it still valid?
  • How was it before?
  • What is the final state?

For example, if you want to track your spending for the last few years and then you begin to notice that each year you change the classification if you do the report by month, it is not going to match with the real spending by year but if you have a proper classification dimension were split the changes you can do it.

In this post, we will learn how to hack the missing OPTION in the MERGE BigQuery implementation to create a Slowly Changing Dimension Type 2, and I am going to explain in more detail the following:

  1. The Set-Up
  2. A brief introduction to SCD type 2
  3. Step-by-Step Execution
  4. Understanding the solution

And I am finally adding Conclusions, Thanks, Troubleshooting, and Helpful Resources.

If you are working with Google Query, I recommend you check my previous posts about Run BigQuery SQL using Python API Client and Connecting DBeaver to Google BigQuery because they have detailed step by step how to create and configure Service Account and how to avoid the BigQuery Web Interface.

We will use two tables, HR_INPUT, as the input that it is changing over time and EmployeeDim for Dimension Type 2:

CREATE OR REPLACE TABLE SCD.HR_INPUT (
ID STRING NOT NULL
,EMPLOYEE STRING NOT NULL
,JOB_TITLE STRING NOT NULL
,COMPANY STRING NOT NULL
,START_YEAR INTEGER NOT NULL
);
CREATE OR REPLACE TABLE SCD.EmployeeDim (
SKEY STRING NOT NULL
,ID STRING NOT NULL
,EMPLOYEE STRING NOT NULL
,JOB_TITLE STRING NOT NULL
,COMPANY STRING NOT NULL
,START_YEAR INTEGER NOT NULL
,END_YEAR INTEGER NOT NULL
,ACTIVE STRING NOT NULL
,CREATED TIMESTAMP NOT NULL
,UPDATED TIMESTAMP NOT NULL
);

In the HR Input Table, you will get the current employer, and we will simulate that we run three times in different years:

Example from the Author

The final table for Employee Dim will look like this:

Final SCD.EmployeeDim result query by the Author using Google BigQuery Web UI

Also, we are going to use the Storage Procedure SP_EmployeeDim_SCD2 to process the data using MERGE:

CREATE OR REPLACE PROCEDURE SCD.SP_EmployeeDim_SCD2()
BEGIN
MERGE SCD.EmployeeDim AS output
USING (
SELECT src.ID as PSEUDO_ID, src.*
FROM SCD.HR_INPUT AS src
UNION ALL
SELECT NULL as PSEUDO_ID, dup.*
FROM SCD.HR_INPUT AS dup
INNER JOIN SCD.EmployeeDim AS trget ON dup.ID = trget.ID
WHERE trget.END_YEAR = 9999
AND trget.START_YEAR <> dup.START_YEAR
) AS input
ON input.PSEUDO_ID = output.ID
WHEN NOT MATCHED THEN INSERT (SKEY,ID,EMPLOYEE,JOB_TITLE,COMPANY,START_YEAR
,END_YEAR,ACTIVE,CREATED,UPDATED)
VALUES ( GENERATE_UUID()
,input.ID
,input.EMPLOYEE
,input.JOB_TITLE
,input.COMPANY
,input.START_YEAR
,9999
,'Y'
,CURRENT_TIMESTAMP()
,CURRENT_TIMESTAMP()
)
WHEN MATCHED
AND output.END_YEAR = 9999
AND output.START_YEAR <> input.START_YEAR THEN
UPDATE SET ACTIVE = 'N'
,END_YEAR = input.START_YEAR
,UPDATED = CURRENT_TIMESTAMP()
;
END;

In Data Modelling, the Slowly Changing Dimensions are an essential part of implementing the tracking of the historical changes in a Dimension table.

The beauty of SCD Type 2 is that it allows us to see the data as It was when it happened and see it as currently active.

For example, suppose you have financial categorization and the spending related to each transaction. In that case, it is easy to fall into a war between departments if we don’t track the changes properly.

Using SCD Type 2, we can see the full year spent with the last version of categorization, but also we can check how it was defined month by month, which helps us to explain why an area is spending more money.

First, let’s delete all the data in both tables for a clean start:

TRUNCATE TABLE SCD.EmployeeDim;
TRUNCATE TABLE SCD.HR_INPUT;

Second, we are going to assume we were in 2008, and the data source reflects my current job as Data Warehouse Architect using the following INSERT statement:

INSERT INTO SCD.HR_INPUT VALUES ('123456','Cristian Saavedra Desmoineaux','Data Warehouse Architect','MILLICOM/TIGO',2008);SELECT * FROM SCD.HR_INPUT;
SCD.HR_INPUT result query by the Author using Google BigQuery Web UI

Then we will call the Stored Procedure SP_EmployeeDim_SCD2 to insert the first row in the Employee Dim table starting in 2008 and ending in 9999:

CALL SCD.SP_EmployeeDim_SCD2();SELECT * FROM SCD.EmployeeDim;
SCD.EmployeeDim result query by the Author using Google BigQuery Web UI

Third, we will assume we were in 2009, and the input data didn’t change it:

CALL SCD.SP_EmployeeDim_SCD2();

As you see, the EmployeeDim table didn’t change too:

SCD.EmployeeDim result query after running the Procedure for the second time by the Author using Google BigQuery Web UI

Fourth, we will assume we were in 2021, and the Data Source shows my current job when I was working in REEF Technologies, and we run the Stored procedure SP_EmployeeDim_SCD2:

TRUNCATE TABLE SCD.HR_INPUT;INSERT INTO SCD.HR_INPUT VALUES ('123456','Cristian Saavedra Desmoineaux','Senior Business Intelligence Engineer - Manager II','REEF Technology',2021);SELECT * FROM SCD.HR_INPUT;CALL SCD.SP_EmployeeDim_SCD2();SELECT * FROM SCD.EmployeeDim;
SCD.HR_INPUT result query after the second change by the Author using Google BigQuery Web UI

Now we have two records. The previous record changed to be valid from 2008 to 2021, and It is inactive:

SCD.EmployeeDim result query after running the Procedure for the third time by the Author using Google BigQuery Web UI

Fifth, now we are in 2022, and the HR Input table reflects my current role:

TRUNCATE TABLE SCD.HR_INPUT;INSERT INTO SCD.HR_INPUT VALUES ('123456','Cristian Saavedra Desmoineaux','Manager, Data & Analytics','Cardinal Health',2022);SELECT * FROM SCD.HR_INPUT;CALL SCD.SP_EmployeeDim_SCD2();SELECT * FROM SCD.EmployeeDim;

We see only one record in the HR Input table:

SCD.HR_INPUT result query after the third change by the Author using Google BigQuery Web UI

And three records in the Employee Dimension, showing my current one as Active and the rest as inactive:

SCD.EmployeeDim result query after running the Procedure for the ford time by the Author using Google BigQuery Web UI

First, let’s take a look at the SP_EmployeeDim_SCD2 Stored Procedure:

Stored Procedure created by the Author using Google BigQuery Web UI

As you can see, there is a MERGE statement that says:

Try to MERGE the Dimension with the Data Source and If the pseudo ID doesn’t match then Insert a New row else Update as inactive and add the end of the year

MERGE SCD.EmployeeDim AS output 
USING (<... subquery ...>) AS input
ON input.PSEUDO_ID = output.ID
WHEN NOT MATCHED THEN
INSERT (<... columns ...>)
VALUES (GENERATE_UUID(),<... all data ...>,9999,'Y'
,CURRENT_TIMESTAMP(),CURRENT_TIMESTAMP())
WHEN MATCHED AND output.END_YEAR = 9999
AND output.START_YEAR <> input.START_YEAR THEN
UPDATE SET ACTIVE = 'N', END_YEAR = input.START_YEAR, UPDATED = CURRENT_TIMESTAMP()

When It doesn’t match, we bring all the data from the HR Input Table, add a Surrogate Key using the GENERATE_UUID() function, add an infinity ending year, mark it as active, and set the timestamp.

WHEN NOT MATCHED THEN
INSERT(SKEY,ID,EMPLOYEE,JOB_TITLE,COMPANY,START_YEAR,END_YEAR,ACTIVE
,CREATED,UPDATED) VALUES (GENERATE_UUID(),input.ID,input.EMPLOYEE
,input.JOB_TITLE,input.COMPANY,input.START_YEAR,9999,'Y'
,CURRENT_TIMESTAMP(),CURRENT_TIMESTAMP())

When we found it, we updated the ones that were still open for the future and had different starting years, and we set them as inactive, adding an ending year and updating the timestamp.

WHEN MATCHED AND output.END_YEAR = 9999 
AND output.START_YEAR <> input.START_YEAR THEN
UPDATE SET ACTIVE = 'N', END_YEAR = input.START_YEAR
, UPDATED = CURRENT_TIMESTAMP()

Initially, I was trying to implement the Kimball Design Tip #107 when I discovered that Google BigQuery hadn’t implemented OPTION as I saw it more than 14 years ago. Then I found It is possible changing the source:

USING (
SELECT src.ID as PSEUDO_ID, src.*
FROM SCD.HR_INPUT AS src
UNION ALL
SELECT NULL as PSEUDO_ID, dup.*
FROM SCD.HR_INPUT AS dup
INNER JOIN SCD.EmployeeDim AS trget ON dup.ID = trget.ID
WHERE trget.END_YEAR = 9999 AND trget.START_YEAR <> dup.START_YEAR
) AS input

I am bringing all the data, duplicating the ID column, and calling it Pseudo ID, where new records and the values that split existing IDs by year will come true.

SELECT src.ID as PSEUDO_ID, src.*
FROM SCD.HR_INPUT AS src

And I add all the records from the Input table that will add it for the existing IDs as active, adding a NULL in the Pseudo ID to the MERGE statement categorized as the ones to Insert.

SELECT NULL as PSEUDO_ID, dup.*
FROM SCD.HR_INPUT AS dup
INNER JOIN SCD.EmployeeDim AS trget ON dup.ID = trget.ID
WHERE trget.END_YEAR = 9999 AND trget.START_YEAR <> dup.START_YEAR

Slowly Changing Dimension Type 2 allows us to see the data as It was when it happened and see it as currently active, and the MERGE Statement can simplify the creation of SCD type 2 but implement it in Google BigQuery; we need to make a change in the Source query to replace OPTION that is currently not available.

If you are working with Google Query, I recommend you check my previous post about Run BigQuery SQL using Python API Client and Connecting DBeaver to Google BigQuery because they have detailed step by step how to create and configure Service Account and how to avoid the BigQuery Web Interface.

I dedicate this post to the memory of Warren Thornthwaite (Kimball Group), a fantastic person and knowledgeable professional who, until his last days, was smiling.

His passion for Data Modelling showed me the path to growth in my professional career when we worked together.

To my friend, I am still the Data magician as you call me.

Error validating procedure body (add OPTIONS(strict_mode=false) to suppress): Query error: Billing has not been enabled for this project. Enable billing at https://console.cloud.google.com/billing. DML queries are not allowed in the free tier. Set up a billing account to remove this restriction. at [3:5]

MERGE Error message Google BigQuery Free Tier using Google BigQuery Web UI

If you are working with a Free Tier, you need to change to a billable project by going to https://console.cloud.google.com/billing and creating an account, setting up a payment method, and then going to my projects, selecting your project, and changing billing.

Remember to go back after you test and lock or disable billing again.

Console Google Cloud Billing to create an account. Screenshot by the Author using Google BigQuery Web UI

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment