Techno Blender
Digitally Yours.

Engineering Best Practices to Apply to Your Analytics Workflow | by Madison Schott | May, 2022

0 104


How to use Github, versioning, and data quality testing to your advantage

Photo by Estúdio Bloom on Unsplash

Whether you started your career as a data engineer, data scientist, or data analyst, best practices in each role can be applied to your analytics workflow. As a former data engineer, I learned best practices for Github such as creating new branches, reviewing pull requests, and versioning my code.

But, I’m not going to lie, when you go from a corporate role to a smaller company that is just getting its data stack off the ground, there are more significant priorities. It can be easy to focus on speed and immediate impact rather than best practices for writing clean code.

It is definitely a delicate balance. You want to make as much of an impact on the business as possible right off the bat. But, you also want to be putting high standards in place from the very beginning. When you don’t start with high standards, you leave yourself a bigger mess to clean up when you decide it’s time to prioritize quality.

So, what are some best practices we can take from traditional software engineering roles and apply to analytics?

In my data engineering role, it was a best practice to create your own branch from main or master, and write your changes on that. This ensured incorrect code wasn’t pushed up to our main branch, breaking the codebase. This acted as a check to minimize the chances of anything going wrong.

In order for that code to make it to the main branch, it had to be reviewed and approved by two other people on my team. We’ll get more into this later, but every team should be doing this. If you have more than one person on your team and are using Github, this is necessary.

Sure, it can slow things down if you’re trying to quickly test your code and push it to production. But, in the long run, it helps more than harms. When creating branches, a new branch can be created for each ticket you are working on or each data model. It can get sloppy when you put all your code changes on one branch.

For example, let’s say you’re working on a customer acquisition data model. You would create a Github branch named customer_acquisition and write your model on that branch. Then, when your task is finished, you can merge this to main for someone to review.

If you were also making changes to the revenue model at the same time while working on the customer_acquisition model, you would create a new branch off of main called revenue to write your changes. This way your changes for the different models aren’t all being written on the same branch.

Doing this makes it easy to track changes across data models and ensure everything is correctly merged into the main codebase.

When you go to merge your branch’s code with the main code base, you need to create a pull request. A pull request on Github will highlight all of the changes made between the main branch’s code and your branch’s code. Seeing these changes highlighted is a great way to ensure you’ve made the changes you intended to make.

A lot of times when doing this I find small typos that I didn’t mean to make while editing my data models. This is a great way to catch those!

In order to increase code quality, you should enforce a mandatory code review by someone on your team before your code can be merged.

How to add branch protection

To do this, you need to navigate to the “Branch Protection” settings and check “Require pull request reviews before merging”. You can then select the number of reviewers you wish to require. If you’re working on a small team, it’s best to only choose one.

You always want to have someone else’s eyes on your code. Two pairs of eyes is always better than one! Your teammate may catch something that you didn’t. If not, you can feel good about knowing your code won’t break something in production.

As the only analytics engineer on the team, I often keep a lot of my code on my local computer. Obviously, this isn’t a great practice. You always want your team to be able to find the latest version of the code, the one that matches what they are seeing in your data warehouse. Forcing branches and code reviews also forces you to push your most recent code to Github. If you made a change, it needs to be seen by others to be approved. Transparency is always best!

Adding versions is a best practice for your data models as well as your data pipelines. They help you identify changes that were made and pinpoint the last success or a reason for failure.

Transformation with dbt

If you’re an analytics engineer, it’s most likely that you are also a user (and lover) of dbt. dbt makes it easy to version control our data models. Within every yaml file is an option to include the version. Every time you make a major update to these files, you should also update the version.

Image by author

Tracking versions for your data models will allow for easier debugging in the case of something going wrong. You can easily revert to the last successful version of a model instead of allowing production to be broken until you discover the issue.

If you use dbt’s documentation feature, versioning is clearly displayed in an easy-to-read UI for you to use to your advantage.

Orchestration

For the same reason it’s important to version control your dbt models, it is also important to choose an orchestration tool that includes versioning. Again, in the case something goes wrong, you want to ensure you can revert back to a previous version so production isn’t down for long.

Personally, I use Prefect for the orchestration of my data models. They offer automatic versioning for every flow, within every project. This makes it easy to track the versions I have deployed to both my development and production environments. I also document the models deployed with each version in the README they provide with every flow.

Image by author

You can see on the very right the version number listed for each of my pipelines. This helps me keep track of how many different versions of each pipeline have been deployed to each environment.

Testing is another practice that is essential in the world of software engineering but can be overlooked when it comes to analytics. But testing should be done with every type of development. Whenever code changes and has the possibility to affect products with end-users (whether it be a business team or actual customers), it needs to be tested!

When I worked as a data engineer we always had at least four different testing environments that code had to be deployed to before finally moving to production. We had multiple development environments, staging, and pre-prod. While I don’t think this is necessary for analytics, we do need to have at least a development environment to test code changes.

Development and production pipelines

Our data moves from point A to point B via our data pipelines. Our pipelines run all of our modeling code. Because of this, it is essential that they are tested. Whenever you need to deploy a new model to production, it needs to be deployed using a testing pipeline first. When making changes to any data model, or creating a new one, there is always the chance of upstream and downstream issues. You’ll be able to catch these within your development environment first rather than when you are deploying changes to production.

I’ve had a few instances where I’ve needed to change column names within my base models. Whenever you’re making changes to your base models, chances are there are downstream models that reference those columns. It can be hard to make all the needed changes downstream if you’re not tracking column lineage. A lot of times a model will fail to run because I forgot to change a column name being referenced. Deploying to development first helps catch these minor mistakes so they don’t ruin the flow of production.

Data quality checks

Another key part of testing is baking in data quality checks to your data pipeline. These tests should run within your pipeline, checking for certain aspects of your data along the way.

dbt tests

dbt tests are a great way to do this if already using dbt to write your data models. You can add them directly to your dbt project, having them check for null, unique, and accepted values as well as relationships between columns.

The best part about dbt tests is that you can run them for your source data as well as your models. I recommend setting up both! This way you are catching any upstream data quality issues but also issues that result from your model’s code.

re_data

I also like to use a dbt package called re_data which can detect anomalies in your data. I specifically like to use the package to test for freshness as well as row count. The package is set up to run like a dbt model and it will send alerts directly to a Slack channel of your choice.

It works by calculating mean and standard deviation for the metrics you are monitoring. It does a great job at identifying any weird issues that may go undetected by other types of tests. To learn how to set up this dbt package, check out this article.

I can’t stress this enough. Best practices should be set from the very beginning. The longer you go without them, the messier your data and code will be getting. You will only be creating more work for yourself and your team in the long run. It may not sound fun and glamorous, but documenting your team’s best practices from the very beginning will take your team far.

We have a lot to learn from software engineering teams as analytics becomes more and more complex. As engineering and analytics intertwine, so do the standards we set in place. We can’t do things how we’ve always done them. We need to embrace Github, versioning, and testing. There is a reason they are commonplace in engineering- they mitigate risks!

For more best practices in the world of analytics engineering, subscribe to my newsletter.


How to use Github, versioning, and data quality testing to your advantage

Photo by Estúdio Bloom on Unsplash

Whether you started your career as a data engineer, data scientist, or data analyst, best practices in each role can be applied to your analytics workflow. As a former data engineer, I learned best practices for Github such as creating new branches, reviewing pull requests, and versioning my code.

But, I’m not going to lie, when you go from a corporate role to a smaller company that is just getting its data stack off the ground, there are more significant priorities. It can be easy to focus on speed and immediate impact rather than best practices for writing clean code.

It is definitely a delicate balance. You want to make as much of an impact on the business as possible right off the bat. But, you also want to be putting high standards in place from the very beginning. When you don’t start with high standards, you leave yourself a bigger mess to clean up when you decide it’s time to prioritize quality.

So, what are some best practices we can take from traditional software engineering roles and apply to analytics?

In my data engineering role, it was a best practice to create your own branch from main or master, and write your changes on that. This ensured incorrect code wasn’t pushed up to our main branch, breaking the codebase. This acted as a check to minimize the chances of anything going wrong.

In order for that code to make it to the main branch, it had to be reviewed and approved by two other people on my team. We’ll get more into this later, but every team should be doing this. If you have more than one person on your team and are using Github, this is necessary.

Sure, it can slow things down if you’re trying to quickly test your code and push it to production. But, in the long run, it helps more than harms. When creating branches, a new branch can be created for each ticket you are working on or each data model. It can get sloppy when you put all your code changes on one branch.

For example, let’s say you’re working on a customer acquisition data model. You would create a Github branch named customer_acquisition and write your model on that branch. Then, when your task is finished, you can merge this to main for someone to review.

If you were also making changes to the revenue model at the same time while working on the customer_acquisition model, you would create a new branch off of main called revenue to write your changes. This way your changes for the different models aren’t all being written on the same branch.

Doing this makes it easy to track changes across data models and ensure everything is correctly merged into the main codebase.

When you go to merge your branch’s code with the main code base, you need to create a pull request. A pull request on Github will highlight all of the changes made between the main branch’s code and your branch’s code. Seeing these changes highlighted is a great way to ensure you’ve made the changes you intended to make.

A lot of times when doing this I find small typos that I didn’t mean to make while editing my data models. This is a great way to catch those!

In order to increase code quality, you should enforce a mandatory code review by someone on your team before your code can be merged.

How to add branch protection

To do this, you need to navigate to the “Branch Protection” settings and check “Require pull request reviews before merging”. You can then select the number of reviewers you wish to require. If you’re working on a small team, it’s best to only choose one.

You always want to have someone else’s eyes on your code. Two pairs of eyes is always better than one! Your teammate may catch something that you didn’t. If not, you can feel good about knowing your code won’t break something in production.

As the only analytics engineer on the team, I often keep a lot of my code on my local computer. Obviously, this isn’t a great practice. You always want your team to be able to find the latest version of the code, the one that matches what they are seeing in your data warehouse. Forcing branches and code reviews also forces you to push your most recent code to Github. If you made a change, it needs to be seen by others to be approved. Transparency is always best!

Adding versions is a best practice for your data models as well as your data pipelines. They help you identify changes that were made and pinpoint the last success or a reason for failure.

Transformation with dbt

If you’re an analytics engineer, it’s most likely that you are also a user (and lover) of dbt. dbt makes it easy to version control our data models. Within every yaml file is an option to include the version. Every time you make a major update to these files, you should also update the version.

Image by author

Tracking versions for your data models will allow for easier debugging in the case of something going wrong. You can easily revert to the last successful version of a model instead of allowing production to be broken until you discover the issue.

If you use dbt’s documentation feature, versioning is clearly displayed in an easy-to-read UI for you to use to your advantage.

Orchestration

For the same reason it’s important to version control your dbt models, it is also important to choose an orchestration tool that includes versioning. Again, in the case something goes wrong, you want to ensure you can revert back to a previous version so production isn’t down for long.

Personally, I use Prefect for the orchestration of my data models. They offer automatic versioning for every flow, within every project. This makes it easy to track the versions I have deployed to both my development and production environments. I also document the models deployed with each version in the README they provide with every flow.

Image by author

You can see on the very right the version number listed for each of my pipelines. This helps me keep track of how many different versions of each pipeline have been deployed to each environment.

Testing is another practice that is essential in the world of software engineering but can be overlooked when it comes to analytics. But testing should be done with every type of development. Whenever code changes and has the possibility to affect products with end-users (whether it be a business team or actual customers), it needs to be tested!

When I worked as a data engineer we always had at least four different testing environments that code had to be deployed to before finally moving to production. We had multiple development environments, staging, and pre-prod. While I don’t think this is necessary for analytics, we do need to have at least a development environment to test code changes.

Development and production pipelines

Our data moves from point A to point B via our data pipelines. Our pipelines run all of our modeling code. Because of this, it is essential that they are tested. Whenever you need to deploy a new model to production, it needs to be deployed using a testing pipeline first. When making changes to any data model, or creating a new one, there is always the chance of upstream and downstream issues. You’ll be able to catch these within your development environment first rather than when you are deploying changes to production.

I’ve had a few instances where I’ve needed to change column names within my base models. Whenever you’re making changes to your base models, chances are there are downstream models that reference those columns. It can be hard to make all the needed changes downstream if you’re not tracking column lineage. A lot of times a model will fail to run because I forgot to change a column name being referenced. Deploying to development first helps catch these minor mistakes so they don’t ruin the flow of production.

Data quality checks

Another key part of testing is baking in data quality checks to your data pipeline. These tests should run within your pipeline, checking for certain aspects of your data along the way.

dbt tests

dbt tests are a great way to do this if already using dbt to write your data models. You can add them directly to your dbt project, having them check for null, unique, and accepted values as well as relationships between columns.

The best part about dbt tests is that you can run them for your source data as well as your models. I recommend setting up both! This way you are catching any upstream data quality issues but also issues that result from your model’s code.

re_data

I also like to use a dbt package called re_data which can detect anomalies in your data. I specifically like to use the package to test for freshness as well as row count. The package is set up to run like a dbt model and it will send alerts directly to a Slack channel of your choice.

It works by calculating mean and standard deviation for the metrics you are monitoring. It does a great job at identifying any weird issues that may go undetected by other types of tests. To learn how to set up this dbt package, check out this article.

I can’t stress this enough. Best practices should be set from the very beginning. The longer you go without them, the messier your data and code will be getting. You will only be creating more work for yourself and your team in the long run. It may not sound fun and glamorous, but documenting your team’s best practices from the very beginning will take your team far.

We have a lot to learn from software engineering teams as analytics becomes more and more complex. As engineering and analytics intertwine, so do the standards we set in place. We can’t do things how we’ve always done them. We need to embrace Github, versioning, and testing. There is a reason they are commonplace in engineering- they mitigate risks!

For more best practices in the world of analytics engineering, subscribe to my newsletter.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment