ETL testing – Testing your data pipelines

By Jessie Hobb On Jan 24, 2023

Forget about the new data trends in 2023! This fundamental data engineering challenge is still not solved.

It is 2023! New data paradigms (or buzz words) like ELT, reverse ETL, EtLT, Data mesh, Data contracts, FinOps and modern data stack found their way into mainstream data conversations. Our data teams are still figuring out what is hype and what is not.

There may be 10 new paradigms tomorrow but some of the fundamental challenges in data engineering — like data quality — are still relevant and not solved completely (I don’t think we ever will solve this problem completely). The first step in improving data quality is to test changes to our data pipelines vigorously.

In this article, let us review the challenges involved in testing data pipelines effectively and how to build a well-rounded testing strategy for your organization.

In software application development world, improving the quality of software meant rigorous testing. Similarly in data engineering, we need a comprehensive testing strategy to achieve high quality data in production.

Most data teams are running against hard deadlines. So data engineering culture is such that we end up building pipelines that serve data by the end of the week instead of incorporating all the best practices that are valuable in the long run.

In ETL testing, we compare huge volumes of data (say millions of records) often from different source systems. We are comparing transformed data that are a result of complex SQL queries or Spark jobs.
Not all data engineers (and the data engineering leaders) are from software engineering background and are strong in SWE development principles and best practices.
Running automated suite of tests and automated deployment/release of data products is still not mainstream.

ETL testing is a data-centric testing process. To effectively test our pipelines we need production like data (in terms of volume, variety and velocity).

Getting access to production like data is hard. Here is how data teams in different companies tackle the problem of getting the right data to test the data pipelines.

Pros: This approach is prevalently used by all of us data engineers because of ease of mock data creation and availability of synthetic data generation tools (such as Faker).

Cons: Mock data does not reflect the production data in terms of volume, variety or velocity.

Pros: Easy to copy fraction of production data than to copy huge swathes of prod data.

Cons: Should use the right sampling strategy to ensure the sample reflects real world prod data. Tests that run successfully on sample prod data might fail on actual prod data because volume and variety is not guaranteed.

Pros: Availability of real world production data for testing.

Cons: If prod contains PII data, it might lead to data privacy violations. If the prod data is constantly changing, then the copy of prod data in test/dev environment will become stale and needs to be constantly updated. Volume and variety guaranteed, but not velocity.

Pros: Availability of real world production data for testing. Compliance to all data privacy regulations.

Cons: Again, a constantly changing prod data means the data in test env becomes stale and needs to be refreshed often. PII anonymization needs to be run every time you copy data out of prod. Manually running anonymization steps every time and maintaining a long-running test data environment is error-prone and resource intensive.

Pros: Availability of real world production data. Automated short-lived test environments that are available through git-like API.

Cons: Add a new tool (such as lakeFS) to your existing data stack.

Note: If you’re interested in exploring data versioning tools for ETL testing, here is a quick guide to using lakeFS for your reference.

Video tutorial on how to use lakeFS to create different data environments for ETL testing. Source: Created by Author.

A comprehensive list of tests need to be run to ensure the quality and reliability of data in the lake house or a ware house.
You could run them using a data quality testing tool (such as Great Expectations, soda.io, etc.,) or build an in-house test suite. Doesn’t matter!
Depending on the project, data set and business, different type of tests need to be executed. However, here is a checklist of basic tests on different dimensions of data quality.

Expanding on the above cheatsheet, consistency checks should include comparing the source and target systems. In some industries, inconsistent data is permissible (YES!) and the variance of error should be below the specified threshold. Because there might be inconsistencies between different data systems if they are updated asynchronously (thanks to micro-services and message queues).

For example, when you compare number of customer orders in the last hour between source and target, the result may vary. But when you run aggregate tests (i.e., number of orders over a month), the data will converge.

Your checks should include data validation and domain value checks. For example, DOB column cannot have a value older than 200 years.

Depending on your domain, data should be in a specific range. Some business KPIs cannot have specific values. For example, a column named click-through ratio cannot have values greater than 1.0 and so on.

It should have checks like record count, column count, percentage of missing values per column, percentage of null values, range checks (min_value, max_value) and basic statistics such as mean, median, variance and data distribution (frequency of values in a specific column) and so on.

Again, depending on the business domain, you can have a threshold for error. In pharma and finance, the data teams strive for 100% data validation with 0% variance in error due to compliance requirements.

Comparing the audit logs of source and data systems and making sure each transformation or data movement step is captured and matches.

Data lineage tools support data audits.

This involves tests around data and column format checks. Data types of columns, % of missing columns, % of columns with mismatched names, % of columns with mismatched data format (Date in one source maybe in MMDDYYYY format and in DDMMYYYY format in another source).

As the name implies, check which columns should have unique values, % of duplicate entries and so on.

Late arriving data is a common challenge that we face. This happens if the data point generated at the source system arrives at the landing zone after a delay. For example, yesterday’s data might land in today’s date partition and get processed with today’s data points. This leads to unexpected data errors.

So, validating the created_timestamp column in the data with the date_partition is crucial too.

The above seven types of tests fall under white box testing. One other type would be unit testing the complex data transformations. This is to assert the KPI definitions and other transformations are as expected.

A data pipeline has many moving parts — starting with multiple data sources, complex data transformations and concurrent data consumers downstream. So it is not sufficient to test only the transformations. End to end testing of ETL pipelines is needed to make sure the pipelines are working as expected.

Suppose if there was a change in one part of the pipeline, it is mandatory to ensure there are no regression errors in other parts of the pipeline due to this change.

This is where automated ETL testing is important. That is, after every change, a suite of tests (sometimes also called continuous integration tests) need to be run. Only if the test suite runs successfully, the change can be pushed to production.

You can use a tool like lakeFS to run automated tests and achieve CI/CD for your data lake.

In addition to ensuring the quality of data, load testing of ETL pipelines is necessary to improve the reliability of data product release.

So, analyze ETL task run times and the order of execution of the tasks to identify bottlenecks. Often times, when the data volume increases, it would slow down the pipeline. By monitoring the run time, optimizing the distributed compute job and by sizing the hardware requirements, this can be resolved.

These are the comprehensive checklist of tests one needs to run. However, most of us run a few of them at work. If you are a data engineer, what is your team’s ETL testing strategy like?

If you like my work and want to support me…

The BEST way to support me is by following me on Medium.
I share tips, tricks and insights on #dataengineering #ml #mlops on LinkedIn.
Feel free to give claps so I know how helpful this post was for you.

Forget about the new data trends in 2023! This fundamental data engineering challenge is still not solved.

In this article, let us review the challenges involved in testing data pipelines effectively and how to build a well-rounded testing strategy for your organization.

Most data teams are running against hard deadlines. So data engineering culture is such that we end up building pipelines that serve data by the end of the week instead of incorporating all the best practices that are valuable in the long run.

In ETL testing, we compare huge volumes of data (say millions of records) often from different source systems. We are comparing transformed data that are a result of complex SQL queries or Spark jobs.
Not all data engineers (and the data engineering leaders) are from software engineering background and are strong in SWE development principles and best practices.
Running automated suite of tests and automated deployment/release of data products is still not mainstream.