How to Use dbt Seeds. What they are and when to use them | by Madison Schott | Apr, 2023


What they are and when to use them

Photo by engin akyurt on Unsplash

I’ll be honest, I haven’t used dbt seeds in a long time. I actually just rediscovered them a few days ago when trying to validate a change I made to one of our data models.

My manager sent me a CSV file with true duplicate values found in one of our data models. I was working on removing these duplicates in one of our sources’ staging models. After making the change, I needed to validate that these duplicates were no longer in the model.

Instead of manually uploading the CSV file into our data warehouse, which is near impossible with Redshift, I used dbt seeds to import the CSV file into our warehouse. I was then able to quickly run a validation query in our warehouse to confirm the duplicates were removed.

It was so easy! And, as an analytics engineer, I LOVE when things are easy. Because let’s be honest, they rarely are. It got me thinking about how we can further use dbt seeds to make our lives as analytics engineers easier.

Not to mention, dbt seeds can also be used by data scientists and machine learning engineers to test their models. You can use it to ingest both test data and training data, allowing them to exist as two different reference points in your data warehouse.

In dbt, there are three main types of objects- sources, models, and seeds. You can test and document seeds just as you can sources and models. The only difference is that they aren’t created from SQL code in your dbt project. They aren’t data models. They’re simply CSV files that you’ve read into dbt to be referenced as a model.

In order to use dbt seeds, you need to have a CSV file that you want to ingest into your data warehouse. First, make sure you can find the location of your CSV file. Then, move it to the seeds directory within your dbt project. If you are already in your dbt project, you can run the following command to move it to the seeds directory:

mv <CSV file path> seeds

Every dbt project automatically has this directory, so you don’t need to recreate it yourself.

Note: make sure you name this CSV file something descriptive! It’s best to set a standard in your dbt style guide for how your seeds should be named.

Then, simply run the dbt seed command:

dbt seed

This will create a table in your data warehouse’s target schema by the name of your CSV file. Now, you can reference this table just like you would a dbt model that you created!

If you’re not familiar, the syntax for selecting from a dbt model is:

{{ ref('duplicate_users') }}

If my CSV file was named duplicate_users.csv then the table in my target schema would be named duplicate_users and I would reference this table in another model as I did above.

dbt seeds can be used for many different reasons. Whenever you need to query a static CSV file, you can create them as a seed and use dbt to transform that data!

Validation

I mentioned an example of using dbt seeds for validation in my introduction. It’s how I most recently used seeds in my own work as an analytics engineer. I had data that someone else pulled for me and I used it to check the output of some changes I made to a data model.

This is great to do when a business team comes to you for help, providing you with a CSV or Excel file that they created to point to some data problem. Instead of recreating the problem yourself, you can use this data directly within dbt to problem solve.

Static Reference Tables

I see most people using dbt seeds to create static reference tables. Because these tables always remain the same, and never need to be changed, it makes sense to use seeds to ingest them into your data warehouse.

Some great examples of this are dim_date tables, country_code tables, or zipcode mapping tables. These are used often in analysis but never change. Instead of manually creating them as a dbt data model, it makes sense to reuse a publicly available file and ingest it into your warehouse. This way, you aren’t making more work for yourself!

Testing and Training Models

Instead of dealing with large datasets, you can use dbt seeds to ingest CSV files to use as your test and training datasets. This will save you time running the model and space storing large amounts of data in your data warehouse, therefore saving you money.

Transferring Business Knowledge to Your Warehouse

Lastly, you can use seeds to move data that has only existed in the form of spreadsheets by business teams. Before a company has a data warehouse, it’s common for most of its data to be stored in Excel spreadsheets.

dbt seeds allow you to easily export these spreadsheets and create them as data tables in hopes of creating a single source of truth. Just keep in mind that you don’t want to have to keep seeding/reseeding spreadsheets that are still actively being used by business teams. This is more so for capturing static or historic data that is no longer being updated.

dbt seeds are powerful when used correctly. If you are already a dbt user, they are an easy solution to moving a CSV file to your data warehouse. It allows for quick validation and static data table creation. So, the next time you need to use a CSV for analysis, consider seeding the file!


What they are and when to use them

Photo by engin akyurt on Unsplash

I’ll be honest, I haven’t used dbt seeds in a long time. I actually just rediscovered them a few days ago when trying to validate a change I made to one of our data models.

My manager sent me a CSV file with true duplicate values found in one of our data models. I was working on removing these duplicates in one of our sources’ staging models. After making the change, I needed to validate that these duplicates were no longer in the model.

Instead of manually uploading the CSV file into our data warehouse, which is near impossible with Redshift, I used dbt seeds to import the CSV file into our warehouse. I was then able to quickly run a validation query in our warehouse to confirm the duplicates were removed.

It was so easy! And, as an analytics engineer, I LOVE when things are easy. Because let’s be honest, they rarely are. It got me thinking about how we can further use dbt seeds to make our lives as analytics engineers easier.

Not to mention, dbt seeds can also be used by data scientists and machine learning engineers to test their models. You can use it to ingest both test data and training data, allowing them to exist as two different reference points in your data warehouse.

In dbt, there are three main types of objects- sources, models, and seeds. You can test and document seeds just as you can sources and models. The only difference is that they aren’t created from SQL code in your dbt project. They aren’t data models. They’re simply CSV files that you’ve read into dbt to be referenced as a model.

In order to use dbt seeds, you need to have a CSV file that you want to ingest into your data warehouse. First, make sure you can find the location of your CSV file. Then, move it to the seeds directory within your dbt project. If you are already in your dbt project, you can run the following command to move it to the seeds directory:

mv <CSV file path> seeds

Every dbt project automatically has this directory, so you don’t need to recreate it yourself.

Note: make sure you name this CSV file something descriptive! It’s best to set a standard in your dbt style guide for how your seeds should be named.

Then, simply run the dbt seed command:

dbt seed

This will create a table in your data warehouse’s target schema by the name of your CSV file. Now, you can reference this table just like you would a dbt model that you created!

If you’re not familiar, the syntax for selecting from a dbt model is:

{{ ref('duplicate_users') }}

If my CSV file was named duplicate_users.csv then the table in my target schema would be named duplicate_users and I would reference this table in another model as I did above.

dbt seeds can be used for many different reasons. Whenever you need to query a static CSV file, you can create them as a seed and use dbt to transform that data!

Validation

I mentioned an example of using dbt seeds for validation in my introduction. It’s how I most recently used seeds in my own work as an analytics engineer. I had data that someone else pulled for me and I used it to check the output of some changes I made to a data model.

This is great to do when a business team comes to you for help, providing you with a CSV or Excel file that they created to point to some data problem. Instead of recreating the problem yourself, you can use this data directly within dbt to problem solve.

Static Reference Tables

I see most people using dbt seeds to create static reference tables. Because these tables always remain the same, and never need to be changed, it makes sense to use seeds to ingest them into your data warehouse.

Some great examples of this are dim_date tables, country_code tables, or zipcode mapping tables. These are used often in analysis but never change. Instead of manually creating them as a dbt data model, it makes sense to reuse a publicly available file and ingest it into your warehouse. This way, you aren’t making more work for yourself!

Testing and Training Models

Instead of dealing with large datasets, you can use dbt seeds to ingest CSV files to use as your test and training datasets. This will save you time running the model and space storing large amounts of data in your data warehouse, therefore saving you money.

Transferring Business Knowledge to Your Warehouse

Lastly, you can use seeds to move data that has only existed in the form of spreadsheets by business teams. Before a company has a data warehouse, it’s common for most of its data to be stored in Excel spreadsheets.

dbt seeds allow you to easily export these spreadsheets and create them as data tables in hopes of creating a single source of truth. Just keep in mind that you don’t want to have to keep seeding/reseeding spreadsheets that are still actively being used by business teams. This is more so for capturing static or historic data that is no longer being updated.

dbt seeds are powerful when used correctly. If you are already a dbt user, they are an easy solution to moving a CSV file to your data warehouse. It allows for quick validation and static data table creation. So, the next time you need to use a CSV for analysis, consider seeding the file!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Aprdbtlatest newsmachine learningMadisonSchottseedsTech News
Comments (0)
Add Comment