How to do Unit Testing in dbt. Solve the challenge of testing data… | by Xiaoxu Gao | May, 2022


Solve the challenge of testing data pipelines

Photo by Claudio Schwarz on Unsplash

In recent years, we’ve seen tremendous growth in the technology and tools for building a data stack. It’s now easier than ever to build a modern data pipeline without massive investment. A standard modern data stack includes but is not limited to:

  • Cloud-first data warehouse — a highly scalable and distributed cloud data warehouse that allows terabytes of data to be transformed using SQL. BigQuery, Snowflake, and Amazon Redshift are currently leading the market.
  • Workflow orchestration — a platform to programmatically schedule and monitor the data transformation. Data warehouse is only a central repository of integrated data from disparate sources. We need a platform like Airflow and dbt to transform the raw data into clean data that makes sense for the business. Unlike data warehouse, many orchestration tools are open-sourced and the infrastructure is maintained by the data team, thus many best practices have been proposed by the engineers such as how to build CICD pipeline, perform testing, monitoring, alerting, etc. This article is one of them which focuses on testing in dbt.
  • Business Intelligence — a tool to allow everyone to explore the data and self-serve their data requests by creating dashboards. Example tools are Tableau and Looker. Employees should be able to make use of the tool for their decision-making.
Photo by Danilo Drobac from Medium

As the title says, in this article, we will be answering one question: how to perform unit testing in dbt? Before seeing the solution, let’s have a bit of context on dbt, its testing strategy, and why it is a challenge for dbt?

dbt

dbt is a SQL-based data transformation tool that enables data analysts and engineers to transform, test, and document data in the cloud data warehouse. Engineers create re-usable data models with domain expertise. Anyone in the company can use the data models to help their decision-making.

Data test in dbt

To ensure the correctness of the data models, we need to apply different types of testing. A must-have test in dbt is data test. It is an integrated step in the pipeline to ensure data quality. Engineers often make assumptions about the pipeline: the source data is valid, the model is sound, and the transformed data is accurate. But that’s usually not the reality. A series of tests should be in place to ensure data quality. Due to the fact that the source data is changing every day, these tests should be part of the runtime tests in production pipeline. Depending on the severity, some tests might block the pipeline to not pollute the downstream. Here I list a couple of common data quality tests:

  • Null values — check columns that shouldn’t have an empty value.
  • Uniqueness check — the combination of a few columns should be unique to avoid duplication.
  • Freshness check — does the model contain the most recent records?
  • Relationship check — does all the required data in the source table transformed into the destination table?

A few dbt packages like dbt-utils provide a bunch of generic tests out-of-the-box that can cover most of the needs. To better understand the role of data test, let’s visualize it:

Created by Xiaoxu Gao

The horizontal path is the flow of data through the data pipeline. Data in each environment can be different, thus data quality tests should run in production flow to ensure that the assumptions on the source data are always valid.

Is it enough?

Unit test in dbt

Data test can discover data quality issues. But it can’t guarantee the correctness of the SQL code. Here is an example of SQL which joins revenuewith vat table to calculate net booking and vat amount. It is difficult to know whether the formula or the join logic is correct only from the data test.

SELECT
date
, city
, SUM(amount_net_booking) as amount_net_booking
, SUM(amount_net_booking * (1 - 1/(1 + vat_rate))) as amount_vat
FROM revenue
LEFT JOIN vat USING (city)
GROUP BY 1,2

What we miss here is the functionality to test SQL logic (aka Unit Testing), just like testing a regular application. If I build an application, I will apply TDD (Test Driven Design), a common software engineering practice to help us achieve a good level of confidence in the correctness. It makes sure that the logic is correct before it’s promoted to production.

Although dbt is a powerful framework, we still write a bunch of SQL code from scratch to build models. As long as it is written by humans, it can go wrong. It is not different from writing a Python, Java program. We can think of a dbt model as a function, where the inputs are source tables and the output is the result of its SQL. A unit test in dbt is a test where we provide mock inputs and then we check the results against the expected model. A big advantage of unit testing is that we can provide mock inputs for all kinds of test cases including edge cases. It is especially useful for SQL code involving complicated business logic. If it’s a chain of models, we can provide mock inputs to the first model and assert the results on the final model and the intermediate models along the way. This chain of testing is called integration test.

Unlike data test, it’s not necessary to integrate unit test into the data pipeline because the code won’t be changed after promoting to a higher environment. We don’t need to make any assumptions about the logic because we have full control of it. As long as the logic is verified in the lower environment, we should trust it and release the code.

Data test in data pipeline v.s. Unit test in application

Created by Xiaoxu Gao

Although data test in data pipeline and unit test in the regular applications are 2 different concepts, they do share common areas. For example, when building a REST API interface, the application receives data from the external which can be unreliable. It must validate the data (aka perform data quality check) before processing it. Although most software engineers don’t consider it a standard test, this type of runtime validation is very similar to data test in the data pipeline.

In another case, a test can be considered as both data test and unit test, such as a relationship check (e.g. sum of field a and b is equal to c ). As an engineer, it’s important to understand the goal of your test, whether it tests data that is dynamic or tests logic that is static. If it’s about logic, then it should be unit test and only execute once in the CICD pipeline. If it’s about data, then it should be part of the production pipeline because of the variances of the data. A few examples of data test:

  • Is there any NULL value in the source data?
  • Does the data contain the recent date?
  • Is every record from the source being transformed into the destination?
  • Is there any value beyond the accepted range?
  • Is there any duplication in the code?

A few examples of unit test:

  • The model calculates revenue for the company, is the calculation correct?
  • The model contains a few IF conditions to calculate a field, am I missing any edge case here?
  • The model does aggregation on a few columns, am I missing any column?

A mature data team leverages both automated data quality checks as well as automated logic tests, ensuring the data has the highest possible value to the company.

Implementation of Unit test in dbt

To demonstrate how unit testing works in dbt, I created the following data pipeline that reads 2 source tables transaction and vat , then transforms the data into the final model revenue . As a data engineer, I don’t necessarily know every detail of the logic at the initial phase, so I created this model based on some sort of assumption.

Lineage graph (Created by Xiaoxu Gao)

As it’s a critical and complicated model, I want to test it with a couple of pre-defined scenarios and create mock data for that. Mock data are CSV files in my dbt project where each CSV file represents one source table. We can store CSV files in data folder and make use of dbt seed command to load data into data warehouse.

folder structure (Created by Xiaoxu Gao)

A few words on how/who to create mock data. To ensure the completeness of the test cases, people who write the code shouldn’t be the one who also provide the test data. In many companies, analysts or business stakeholders are the ones provide test cases or test data.

As a common practice, the mock data should never be mixed up with the real source table. Preferably, it should be stored in a separate schema (e.g. unit_testing). We can define this in dbt_project.yml file. Every file in unit_testing/revenue folder will be loaded into unit_testing schema. tags is also playing a critical role when selecting models. When running dbt build -s +tag:unit_testing, it will run all the seeds/models/tests/snapshots with tag unit_testing and their upstreams.

seeds:
unit_testing:
revenue:
schema: unit_testing
+tags:
- unit_testing

For the demo purpose, my dbt model is relatively simpler than the reality.

revenue model (Created by Xiaoxu Gao)

Are you seeing a problem here?

The problem is the source table. Like always, I use {{ source() }} to refer to the actual source table. But my mock data is in a separate schema and they should be referenced by using ref function because they are seeds. How am I going to switch the source? To solve the problem, I’m creating a custom marco that is able to switch the source table based on the environment or a kind of trigger.

select_table marco (Created by Xiaoxu Gao)

This marco has 2 input parameters: source_table -> the table has the real data and test_table -> the table has the mock data. The marco returns the table based on the variable in the dbt command. If the command doesn’t provide unit_testing variable or the value is false , then it returns source_table , otherwise it returns test_table . You can modify this marco based on your situation, for example, switch the table based on the target instead of the variable.

dbt command to run model and test using mock data

dbt build -s +tag:unit_testing --vars 'unit_testing: true'

dbt command to run model using real data. (will explain why we need --exclude flag here later)

dbt build -s +tag:revenue --exclude tag:unit_testing

The updated model now looks like this. We use variable {{ import_* }} to represent the correct source table.

Updated revenue model (Created by Xiaoxu Gao)

Ok so far, we’ve seen how to select the correct source table and run the model on top of the mock inputs. Next, we are going to compare the actual output with the expected mock output. We can make use of equality test offered by dbt-utils to compare 2 models. In compare_model, we refer to the expected output model.

It’s worth noting that we shouldn’t forget tags: ['unit_testing'] in the test. As we said previously, unit test shouldn’t be running as part of the production pipeline. Running this equality test in production actually means we are comparing the production model with the mock output which will never work. Adding the tag unit_testing and run command dbt build -s +tag:revenue --exclude tag:unit_testing makes sure that the test will be skipped in production.

Wow, I hope you are still following me 🙂 Let’s check out different ways to compare models. dbt_utils.equality test does exact matching of models, but this gets tricky when comparing numeric numbers because of the precision issue. A way to get around this is to create a custom marco that first rounds numeric columns, then makes the comparison.

Custom marco to compare models (Created by Xiaoxu Gao)

It’s up to you to go creative with this part as long as it helps you spot the difference.

Are we fully safe?

Unfortunately No. Unit testing gives us the possibility to test logic with custom scenarios. But the problem is: Do these scenarios cover everything? Probably not. There are 2 issues:

Hard to provide a perfect data set that covers all the permutations — unknown knowns

The source data is often varied, it’s incredibly time-consuming to figure out all the possible combinations and create test data for each of them. One of the strategies is to create a separate layer beforehand that filters out invalid data and does a few sanity checks. It sort of limits the number of permutations so we are more confident about what’s gonna come and can create relatively limited dataset.

Oracle Problem — (un)known unknowns

In my opinion, it’s a philosophical problem. Every test is a comparison of the actual output and the expected output.

Oracle problem refers to the situation where it’s challenging to determine the expected output. Basically, humans are never sufficiently equipped to test their programs because the reality is always more complicated than its representation in a test suite.

As mentioned previously, data source is complicated. There are a lot of scenarios where we don’t know how to handle it. A simple strategy is to use one logic to process all the unknown cases and take that as the answer. Think of the role of else in a if condition.

If it’s still not enough, then we just have to live with it. Humans are acquiring new knowledge every day, so does the program. As we find out more issues, more unknowns will be transformed into knowns, thus we will be able to capture more and more and understand the world a little bit better every day.

Conclusion

I hope you find this article useful. If you are convinced and start to apply unit test in your data pipeline, that’s perfect! If you are not immediately convinced, but still want to get a takeaway from this article, then it would be understanding the role of unit test and data test in data pipeline because you will need it one day. If you have any thoughts about testing in data pipeline, please let us know in the comments. Cheers!


Solve the challenge of testing data pipelines

Photo by Claudio Schwarz on Unsplash

In recent years, we’ve seen tremendous growth in the technology and tools for building a data stack. It’s now easier than ever to build a modern data pipeline without massive investment. A standard modern data stack includes but is not limited to:

  • Cloud-first data warehouse — a highly scalable and distributed cloud data warehouse that allows terabytes of data to be transformed using SQL. BigQuery, Snowflake, and Amazon Redshift are currently leading the market.
  • Workflow orchestration — a platform to programmatically schedule and monitor the data transformation. Data warehouse is only a central repository of integrated data from disparate sources. We need a platform like Airflow and dbt to transform the raw data into clean data that makes sense for the business. Unlike data warehouse, many orchestration tools are open-sourced and the infrastructure is maintained by the data team, thus many best practices have been proposed by the engineers such as how to build CICD pipeline, perform testing, monitoring, alerting, etc. This article is one of them which focuses on testing in dbt.
  • Business Intelligence — a tool to allow everyone to explore the data and self-serve their data requests by creating dashboards. Example tools are Tableau and Looker. Employees should be able to make use of the tool for their decision-making.
Photo by Danilo Drobac from Medium

As the title says, in this article, we will be answering one question: how to perform unit testing in dbt? Before seeing the solution, let’s have a bit of context on dbt, its testing strategy, and why it is a challenge for dbt?

dbt

dbt is a SQL-based data transformation tool that enables data analysts and engineers to transform, test, and document data in the cloud data warehouse. Engineers create re-usable data models with domain expertise. Anyone in the company can use the data models to help their decision-making.

Data test in dbt

To ensure the correctness of the data models, we need to apply different types of testing. A must-have test in dbt is data test. It is an integrated step in the pipeline to ensure data quality. Engineers often make assumptions about the pipeline: the source data is valid, the model is sound, and the transformed data is accurate. But that’s usually not the reality. A series of tests should be in place to ensure data quality. Due to the fact that the source data is changing every day, these tests should be part of the runtime tests in production pipeline. Depending on the severity, some tests might block the pipeline to not pollute the downstream. Here I list a couple of common data quality tests:

  • Null values — check columns that shouldn’t have an empty value.
  • Uniqueness check — the combination of a few columns should be unique to avoid duplication.
  • Freshness check — does the model contain the most recent records?
  • Relationship check — does all the required data in the source table transformed into the destination table?

A few dbt packages like dbt-utils provide a bunch of generic tests out-of-the-box that can cover most of the needs. To better understand the role of data test, let’s visualize it:

Created by Xiaoxu Gao

The horizontal path is the flow of data through the data pipeline. Data in each environment can be different, thus data quality tests should run in production flow to ensure that the assumptions on the source data are always valid.

Is it enough?

Unit test in dbt

Data test can discover data quality issues. But it can’t guarantee the correctness of the SQL code. Here is an example of SQL which joins revenuewith vat table to calculate net booking and vat amount. It is difficult to know whether the formula or the join logic is correct only from the data test.

SELECT
date
, city
, SUM(amount_net_booking) as amount_net_booking
, SUM(amount_net_booking * (1 - 1/(1 + vat_rate))) as amount_vat
FROM revenue
LEFT JOIN vat USING (city)
GROUP BY 1,2

What we miss here is the functionality to test SQL logic (aka Unit Testing), just like testing a regular application. If I build an application, I will apply TDD (Test Driven Design), a common software engineering practice to help us achieve a good level of confidence in the correctness. It makes sure that the logic is correct before it’s promoted to production.

Although dbt is a powerful framework, we still write a bunch of SQL code from scratch to build models. As long as it is written by humans, it can go wrong. It is not different from writing a Python, Java program. We can think of a dbt model as a function, where the inputs are source tables and the output is the result of its SQL. A unit test in dbt is a test where we provide mock inputs and then we check the results against the expected model. A big advantage of unit testing is that we can provide mock inputs for all kinds of test cases including edge cases. It is especially useful for SQL code involving complicated business logic. If it’s a chain of models, we can provide mock inputs to the first model and assert the results on the final model and the intermediate models along the way. This chain of testing is called integration test.

Unlike data test, it’s not necessary to integrate unit test into the data pipeline because the code won’t be changed after promoting to a higher environment. We don’t need to make any assumptions about the logic because we have full control of it. As long as the logic is verified in the lower environment, we should trust it and release the code.

Data test in data pipeline v.s. Unit test in application

Created by Xiaoxu Gao

Although data test in data pipeline and unit test in the regular applications are 2 different concepts, they do share common areas. For example, when building a REST API interface, the application receives data from the external which can be unreliable. It must validate the data (aka perform data quality check) before processing it. Although most software engineers don’t consider it a standard test, this type of runtime validation is very similar to data test in the data pipeline.

In another case, a test can be considered as both data test and unit test, such as a relationship check (e.g. sum of field a and b is equal to c ). As an engineer, it’s important to understand the goal of your test, whether it tests data that is dynamic or tests logic that is static. If it’s about logic, then it should be unit test and only execute once in the CICD pipeline. If it’s about data, then it should be part of the production pipeline because of the variances of the data. A few examples of data test:

  • Is there any NULL value in the source data?
  • Does the data contain the recent date?
  • Is every record from the source being transformed into the destination?
  • Is there any value beyond the accepted range?
  • Is there any duplication in the code?

A few examples of unit test:

  • The model calculates revenue for the company, is the calculation correct?
  • The model contains a few IF conditions to calculate a field, am I missing any edge case here?
  • The model does aggregation on a few columns, am I missing any column?

A mature data team leverages both automated data quality checks as well as automated logic tests, ensuring the data has the highest possible value to the company.

Implementation of Unit test in dbt

To demonstrate how unit testing works in dbt, I created the following data pipeline that reads 2 source tables transaction and vat , then transforms the data into the final model revenue . As a data engineer, I don’t necessarily know every detail of the logic at the initial phase, so I created this model based on some sort of assumption.

Lineage graph (Created by Xiaoxu Gao)

As it’s a critical and complicated model, I want to test it with a couple of pre-defined scenarios and create mock data for that. Mock data are CSV files in my dbt project where each CSV file represents one source table. We can store CSV files in data folder and make use of dbt seed command to load data into data warehouse.

folder structure (Created by Xiaoxu Gao)

A few words on how/who to create mock data. To ensure the completeness of the test cases, people who write the code shouldn’t be the one who also provide the test data. In many companies, analysts or business stakeholders are the ones provide test cases or test data.

As a common practice, the mock data should never be mixed up with the real source table. Preferably, it should be stored in a separate schema (e.g. unit_testing). We can define this in dbt_project.yml file. Every file in unit_testing/revenue folder will be loaded into unit_testing schema. tags is also playing a critical role when selecting models. When running dbt build -s +tag:unit_testing, it will run all the seeds/models/tests/snapshots with tag unit_testing and their upstreams.

seeds:
unit_testing:
revenue:
schema: unit_testing
+tags:
- unit_testing

For the demo purpose, my dbt model is relatively simpler than the reality.

revenue model (Created by Xiaoxu Gao)

Are you seeing a problem here?

The problem is the source table. Like always, I use {{ source() }} to refer to the actual source table. But my mock data is in a separate schema and they should be referenced by using ref function because they are seeds. How am I going to switch the source? To solve the problem, I’m creating a custom marco that is able to switch the source table based on the environment or a kind of trigger.

select_table marco (Created by Xiaoxu Gao)

This marco has 2 input parameters: source_table -> the table has the real data and test_table -> the table has the mock data. The marco returns the table based on the variable in the dbt command. If the command doesn’t provide unit_testing variable or the value is false , then it returns source_table , otherwise it returns test_table . You can modify this marco based on your situation, for example, switch the table based on the target instead of the variable.

dbt command to run model and test using mock data

dbt build -s +tag:unit_testing --vars 'unit_testing: true'

dbt command to run model using real data. (will explain why we need --exclude flag here later)

dbt build -s +tag:revenue --exclude tag:unit_testing

The updated model now looks like this. We use variable {{ import_* }} to represent the correct source table.

Updated revenue model (Created by Xiaoxu Gao)

Ok so far, we’ve seen how to select the correct source table and run the model on top of the mock inputs. Next, we are going to compare the actual output with the expected mock output. We can make use of equality test offered by dbt-utils to compare 2 models. In compare_model, we refer to the expected output model.

It’s worth noting that we shouldn’t forget tags: ['unit_testing'] in the test. As we said previously, unit test shouldn’t be running as part of the production pipeline. Running this equality test in production actually means we are comparing the production model with the mock output which will never work. Adding the tag unit_testing and run command dbt build -s +tag:revenue --exclude tag:unit_testing makes sure that the test will be skipped in production.

Wow, I hope you are still following me 🙂 Let’s check out different ways to compare models. dbt_utils.equality test does exact matching of models, but this gets tricky when comparing numeric numbers because of the precision issue. A way to get around this is to create a custom marco that first rounds numeric columns, then makes the comparison.

Custom marco to compare models (Created by Xiaoxu Gao)

It’s up to you to go creative with this part as long as it helps you spot the difference.

Are we fully safe?

Unfortunately No. Unit testing gives us the possibility to test logic with custom scenarios. But the problem is: Do these scenarios cover everything? Probably not. There are 2 issues:

Hard to provide a perfect data set that covers all the permutations — unknown knowns

The source data is often varied, it’s incredibly time-consuming to figure out all the possible combinations and create test data for each of them. One of the strategies is to create a separate layer beforehand that filters out invalid data and does a few sanity checks. It sort of limits the number of permutations so we are more confident about what’s gonna come and can create relatively limited dataset.

Oracle Problem — (un)known unknowns

In my opinion, it’s a philosophical problem. Every test is a comparison of the actual output and the expected output.

Oracle problem refers to the situation where it’s challenging to determine the expected output. Basically, humans are never sufficiently equipped to test their programs because the reality is always more complicated than its representation in a test suite.

As mentioned previously, data source is complicated. There are a lot of scenarios where we don’t know how to handle it. A simple strategy is to use one logic to process all the unknown cases and take that as the answer. Think of the role of else in a if condition.

If it’s still not enough, then we just have to live with it. Humans are acquiring new knowledge every day, so does the program. As we find out more issues, more unknowns will be transformed into knowns, thus we will be able to capture more and more and understand the world a little bit better every day.

Conclusion

I hope you find this article useful. If you are convinced and start to apply unit test in your data pipeline, that’s perfect! If you are not immediately convinced, but still want to get a takeaway from this article, then it would be understanding the role of unit test and data test in data pipeline because you will need it one day. If you have any thoughts about testing in data pipeline, please let us know in the comments. Cheers!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsChallengeDatadbtGaolatest newssolveTechnoblendertestingUnitXiaoxu
Comments (0)
Add Comment