Techno Blender
Digitally Yours.

What is dbt (data build tool) and When should you use it? | by Khuyen Tran | Apr, 2023

0 93


Discover the Hidden Benefits and Drawbacks of dbt

Image by Author

If your organization is looking to create a data-driven product, you should consider having efficient data pipelines to:

  1. Stay competitive: Quick access to data and its analysis through efficient data pipelines accelerates decision-making and keeps you ahead of your competition.
  2. Reduce costs: The time and effort required to collect and transform data can be significantly reduced with efficient data pipelines, which can reduce costs and enable employees to focus on higher-level tasks that require human intelligence.
Image by Author

One tool that has gained popularity in recent years for managing data pipelines is dbt (data build tool).

While dbt can provide significant benefits for managing and modeling data, it may not be the best tool for every situation. In this article, we’ll explore the use cases of dbt to determine whether dbt is a good fit for your organization.

Feel free to play and fork the source code of this article here:

dbt is an open-source tool for transforming data in a data warehouse. With dbt, users can write SQL queries that transform data and create repeatable workflows that can be easily tested and automated.

Image by Author

You should consider using dbt when:

  1. You have a data warehouse: dbt is an effective tool for organizing, transforming, and testing data in a data warehouse environment.
  2. Your data pipeline is complex: dbt’s modular approach can help break down complex pipelines into smaller pieces, making it easier to debug, test, and make changes to your pipeline.
  3. Your data changes frequently: dbt’s snapshot allows you to track changes in data over time.
  4. Transparency is essential to your organization: dbt can automatically generate documentation for your pipeline. It also provides insights into the data transformation process through data lineage.
  5. Data quality is essential to your organization: dbt can test your models to help you catch errors early and ensure that your data is accurate.

Before diving into some helpful features of dbt, we need to first set up the dbt environment. This will allow you to easily follow along with the discussion.

Install dbt

This article uses BigQuery as a data warehouse for dbt, to install dbt’s adapter for BigQuery, type:

pip install dbt-bigquery

Follow this instruction if you want to use other dbt’s adapters.

Create a project

To initialize a dbt project called dbt_demo, type the following:

dbt init dbt_demo

The following is the project directory.

Image by Author

dbt’s modular approach allows you to break down a complex pipeline into smaller pieces.

To demonstrate this, we will use the austin_bikeshare dataset from BigQuery public datasets.

Create a new SQL file in the models directory, named models/trips_per_property.sql with the following code:

-- models/trips_per_property.sql

WITH stations AS (
SELECT
station_id, council_district,
CASE
WHEN property_type IN ('parkland', 'sidewalk', 'nonmetered_parking') THEN 'free_parking'
ELSE property_type
END AS property_type,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations`
WHERE
property_type IN (
'parkland',
'sidewalk',
'nonmetered_parking',
'paid_parking'
)
),
trips AS (
SELECT
start_station_id
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
start_station_id is NOT NULL
)
SELECT
stations.property_type,
COUNT(*) AS trips,
FROM
trips
JOIN stations ON trips.start_station_id = stations.station_id
GROUP BY
stations.property_type

Since this model contains multiple data transformations, it can be difficult to maintain and test.

We could separate the model into different models by doing the following:

  • Create two files stg_stations.sql and stg_trips.sql under the models directory.
  • Replace the existing stations CTE and trips CTE with the following SQL code:
Image by Author
  • In the stg_trips.sql file, write the SQL code from the trips CTE in the original model.
-- models/stg_trips.sql

SELECT start_station_id
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_id is NOT NULL

  • In the stg_stations.sql file, write the SQL code from the stations CTE in the original model.
-- models/stg_stations.sql

SELECT
station_id, council_district,
CASE
WHEN property_type IN ('parkland', 'sidewalk', 'nonmetered_parking') THEN 'free_parking'
ELSE property_type
END AS property_type,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations`
WHERE
property_type IN ('parkland', 'sidewalk', 'nonmetered_parking', 'paid_parking')

By breaking down the model into multiple models, you can now reuse the data models in other models:

Image by Author

Your team members can also work on the different parts of the project simultaneously.

Image by Author

Because each module is self-contained, it can be easily tested and validated.

Image by Author

Using macros in dbt simplifies the process of reusing SQL code fragments across various models.

To demonstrate, we will create a macro called get_end_time in the macros/get_end_time.sql file. This micro gets an end time based on a start time and duration.

# macros/get_end_time.sql
{% macro get_end_time(start_time, duration_minutes) %}
(SELECT TIMESTAMP_ADD({{ start_time }}, INTERVAL {{ duration_minutes }} MINUTE) AS end_time)
{% endmacro %}

Think of the get_end_time macro as a Python function that can be used repeatedly to generate a value based on the inputs provided.

Image by Author

Another model can now use this micro with specific values for start_time and duration_minutes:

SELECT
start_station_id,
start_time,
end_station_id,
{{ get_end_time('start_time', 'duration_minutes') }} AS end_time
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
start_station_id IS NOT NULL AND end_station_id IS NOT NULL

With dbt, you can easily verify the correctness of your models. For instance, to ensure the stg_stations model is correct, you may want to check that:

  • The station_id column is non-null and has unique values.
  • The property_type column only has the values ‘free_parking’ or ‘paid_parking’.

To test the stg_stations model, create a new YAML file in the models directory, named models/schema.yml with the following content.

version: 2

models:
- name: stg_stations
columns:
- name: station_id
tests:
# Should not contain null values.
- not_null
# Should be unique
- unique
- name: property_type
tests:
# Should be either 'free_parking' or 'paid_parking'.
- accepted_values:
values: ['free_parking', 'paid_parking']

Then run dbt test to run all of the tests.

By catching potential errors early in the transformation process, tests can reduce the risk of errors in downstream applications and analyses.

Image by Author

In addition, tests provide a way to track changes in the data over time and to ensure that the transformation logic remains correct as the underlying data changes.

Image by Author

With dbt, you can easily document your models and share the documentation with your team.

To create documentation for your models, follow these steps:

  • Add descriptions to your models/schema.yml file.
Image by Author

View full file.

  • Run dbt docs generate to generate the documentation for your project.
  • Run dbt docs serve command to launch the documentation.
  • Access the documentation by navigating to http://localhost:8080 in your web browser.
Image by Author

Documenting your data models helps to ensure that everyone who works with the models can comprehend their functionality and purpose.

You can also examine data lineage in the documentation by clicking “View lineage graph.”

Image by Author

With dbt, you can track data changes over time through snapshots.

To see why this is useful, imagine you have a subscriber_type table where the subscriber_type field is frequently overwritten as subscribers modify their membership status.

If a user changes their membership from “trial” to “premium”, the record of when the user had the “trial” membership will be lost.

Image by Author

Losing this record can be problematic if we want to use it for other evaluations such as calculating the average time taken by a user to change their subscription.

dbt can snapshot these changes to help you understand how values in a row change over time. Here’s an example of a snapshot table for the previous example:

Image by Author

View this tutorial on how to create snapshots.

While dbt can provide significant benefits for managing and modeling data, you should not use dbt when:

  1. You don’t have a data warehouse: dbt is designed to work specifically with data warehouses, such as Snowflake, Redshift, or BigQuery. If you don’t have a data warehouse, then dbt may not be the right tool for you.
  2. You want a one-stop shop for ETL: While dbt is an excellent tool for transforming data, it does not offer a complete end-to-end solution for ETL (Extract, Transform, Load) tasks. Other tools are needed for tasks such as data extraction, data cleansing, and data loading.
  3. You want to visualize your data: dbt is not a data visualization tool. If you want to create visualizations, you will need to use a separate tool or library such as Tableau, Looker, or PowerBI.
  4. You want advanced feature engineering beyond the limitations of SQL grammar: As SQL is dbt’s primary language, there are some limitations when it comes to feature engineering. If you need to perform advanced feature engineering tasks that go beyond the capabilities of SQL, you may need to use other tools or languages such as Python or R.
Image by Author

In this article, we cover some helpful features of dbt and learn when to and not to use dbt. I hope this article gives you the knowledge needed to decide whether dbt is the right solution for your data pipelines.


Discover the Hidden Benefits and Drawbacks of dbt

Image by Author

If your organization is looking to create a data-driven product, you should consider having efficient data pipelines to:

  1. Stay competitive: Quick access to data and its analysis through efficient data pipelines accelerates decision-making and keeps you ahead of your competition.
  2. Reduce costs: The time and effort required to collect and transform data can be significantly reduced with efficient data pipelines, which can reduce costs and enable employees to focus on higher-level tasks that require human intelligence.
Image by Author

One tool that has gained popularity in recent years for managing data pipelines is dbt (data build tool).

While dbt can provide significant benefits for managing and modeling data, it may not be the best tool for every situation. In this article, we’ll explore the use cases of dbt to determine whether dbt is a good fit for your organization.

Feel free to play and fork the source code of this article here:

dbt is an open-source tool for transforming data in a data warehouse. With dbt, users can write SQL queries that transform data and create repeatable workflows that can be easily tested and automated.

Image by Author

You should consider using dbt when:

  1. You have a data warehouse: dbt is an effective tool for organizing, transforming, and testing data in a data warehouse environment.
  2. Your data pipeline is complex: dbt’s modular approach can help break down complex pipelines into smaller pieces, making it easier to debug, test, and make changes to your pipeline.
  3. Your data changes frequently: dbt’s snapshot allows you to track changes in data over time.
  4. Transparency is essential to your organization: dbt can automatically generate documentation for your pipeline. It also provides insights into the data transformation process through data lineage.
  5. Data quality is essential to your organization: dbt can test your models to help you catch errors early and ensure that your data is accurate.

Before diving into some helpful features of dbt, we need to first set up the dbt environment. This will allow you to easily follow along with the discussion.

Install dbt

This article uses BigQuery as a data warehouse for dbt, to install dbt’s adapter for BigQuery, type:

pip install dbt-bigquery

Follow this instruction if you want to use other dbt’s adapters.

Create a project

To initialize a dbt project called dbt_demo, type the following:

dbt init dbt_demo

The following is the project directory.

Image by Author

dbt’s modular approach allows you to break down a complex pipeline into smaller pieces.

To demonstrate this, we will use the austin_bikeshare dataset from BigQuery public datasets.

Create a new SQL file in the models directory, named models/trips_per_property.sql with the following code:

-- models/trips_per_property.sql

WITH stations AS (
SELECT
station_id, council_district,
CASE
WHEN property_type IN ('parkland', 'sidewalk', 'nonmetered_parking') THEN 'free_parking'
ELSE property_type
END AS property_type,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations`
WHERE
property_type IN (
'parkland',
'sidewalk',
'nonmetered_parking',
'paid_parking'
)
),
trips AS (
SELECT
start_station_id
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
start_station_id is NOT NULL
)
SELECT
stations.property_type,
COUNT(*) AS trips,
FROM
trips
JOIN stations ON trips.start_station_id = stations.station_id
GROUP BY
stations.property_type

Since this model contains multiple data transformations, it can be difficult to maintain and test.

We could separate the model into different models by doing the following:

  • Create two files stg_stations.sql and stg_trips.sql under the models directory.
  • Replace the existing stations CTE and trips CTE with the following SQL code:
Image by Author
  • In the stg_trips.sql file, write the SQL code from the trips CTE in the original model.
-- models/stg_trips.sql

SELECT start_station_id
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_id is NOT NULL

  • In the stg_stations.sql file, write the SQL code from the stations CTE in the original model.
-- models/stg_stations.sql

SELECT
station_id, council_district,
CASE
WHEN property_type IN ('parkland', 'sidewalk', 'nonmetered_parking') THEN 'free_parking'
ELSE property_type
END AS property_type,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations`
WHERE
property_type IN ('parkland', 'sidewalk', 'nonmetered_parking', 'paid_parking')

By breaking down the model into multiple models, you can now reuse the data models in other models:

Image by Author

Your team members can also work on the different parts of the project simultaneously.

Image by Author

Because each module is self-contained, it can be easily tested and validated.

Image by Author

Using macros in dbt simplifies the process of reusing SQL code fragments across various models.

To demonstrate, we will create a macro called get_end_time in the macros/get_end_time.sql file. This micro gets an end time based on a start time and duration.

# macros/get_end_time.sql
{% macro get_end_time(start_time, duration_minutes) %}
(SELECT TIMESTAMP_ADD({{ start_time }}, INTERVAL {{ duration_minutes }} MINUTE) AS end_time)
{% endmacro %}

Think of the get_end_time macro as a Python function that can be used repeatedly to generate a value based on the inputs provided.

Image by Author

Another model can now use this micro with specific values for start_time and duration_minutes:

SELECT
start_station_id,
start_time,
end_station_id,
{{ get_end_time('start_time', 'duration_minutes') }} AS end_time
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
start_station_id IS NOT NULL AND end_station_id IS NOT NULL

With dbt, you can easily verify the correctness of your models. For instance, to ensure the stg_stations model is correct, you may want to check that:

  • The station_id column is non-null and has unique values.
  • The property_type column only has the values ‘free_parking’ or ‘paid_parking’.

To test the stg_stations model, create a new YAML file in the models directory, named models/schema.yml with the following content.

version: 2

models:
- name: stg_stations
columns:
- name: station_id
tests:
# Should not contain null values.
- not_null
# Should be unique
- unique
- name: property_type
tests:
# Should be either 'free_parking' or 'paid_parking'.
- accepted_values:
values: ['free_parking', 'paid_parking']

Then run dbt test to run all of the tests.

By catching potential errors early in the transformation process, tests can reduce the risk of errors in downstream applications and analyses.

Image by Author

In addition, tests provide a way to track changes in the data over time and to ensure that the transformation logic remains correct as the underlying data changes.

Image by Author

With dbt, you can easily document your models and share the documentation with your team.

To create documentation for your models, follow these steps:

  • Add descriptions to your models/schema.yml file.
Image by Author

View full file.

  • Run dbt docs generate to generate the documentation for your project.
  • Run dbt docs serve command to launch the documentation.
  • Access the documentation by navigating to http://localhost:8080 in your web browser.
Image by Author

Documenting your data models helps to ensure that everyone who works with the models can comprehend their functionality and purpose.

You can also examine data lineage in the documentation by clicking “View lineage graph.”

Image by Author

With dbt, you can track data changes over time through snapshots.

To see why this is useful, imagine you have a subscriber_type table where the subscriber_type field is frequently overwritten as subscribers modify their membership status.

If a user changes their membership from “trial” to “premium”, the record of when the user had the “trial” membership will be lost.

Image by Author

Losing this record can be problematic if we want to use it for other evaluations such as calculating the average time taken by a user to change their subscription.

dbt can snapshot these changes to help you understand how values in a row change over time. Here’s an example of a snapshot table for the previous example:

Image by Author

View this tutorial on how to create snapshots.

While dbt can provide significant benefits for managing and modeling data, you should not use dbt when:

  1. You don’t have a data warehouse: dbt is designed to work specifically with data warehouses, such as Snowflake, Redshift, or BigQuery. If you don’t have a data warehouse, then dbt may not be the right tool for you.
  2. You want a one-stop shop for ETL: While dbt is an excellent tool for transforming data, it does not offer a complete end-to-end solution for ETL (Extract, Transform, Load) tasks. Other tools are needed for tasks such as data extraction, data cleansing, and data loading.
  3. You want to visualize your data: dbt is not a data visualization tool. If you want to create visualizations, you will need to use a separate tool or library such as Tableau, Looker, or PowerBI.
  4. You want advanced feature engineering beyond the limitations of SQL grammar: As SQL is dbt’s primary language, there are some limitations when it comes to feature engineering. If you need to perform advanced feature engineering tasks that go beyond the capabilities of SQL, you may need to use other tools or languages such as Python or R.
Image by Author

In this article, we cover some helpful features of dbt and learn when to and not to use dbt. I hope this article gives you the knowledge needed to decide whether dbt is the right solution for your data pipelines.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment