Apply Data Quality Checks at These 5 Points in Your Data Journey | by Hanzala Qureshi | May, 2022


Improve Quality of Your Data by Applying These Checks in Your Journey

Photo by Martin Adams on Unsplash

Imagine you have just delivered a fancy new Data Lake with some cool Data Pipelines that bring data from all over your organisation. Now imagine your dismay when your business teams realise the data in the Lake is all garbage.

Remember the adage: garbage in, garbage out.

This isn’t necessarily true; if garbage goes in, you can absolutely clean it on its journey and make sure only clean data is published. So let’s look at five places where you should clean your data.

If you are unfamiliar with Data Quality dimensions, I suggest you read №7 on this article first:

1. Data Capture

This is your first line of defence, a team of people working in your stores, your call centres or perhaps as online support agents. It could be your online sign-up forms or physical documents that your agents must manually input into your systems. Whatever is the method to collect data from your clients, it is imperative that at this point, data is complete, unique and valid.

Getting data captured correctly will save 3–4 times the effort in fixing it downstream in your other layers. So focus on these data quality dimensions:

Completeness: The data being collected and captured is complete (not NULLs), i.e. all mandatory fields have information added, and it is not missing any key data points.

Uniqueness: The data is kept as unique as possible, i.e. if a client already has an account, another account is not being set up. If the mobile number already exists in the system, the current order is linked to the old order etc.

Validity: The data being captured conforms to the corporate standards, i.e. account number is eight digits long and starts with a number 9 is conformed with at the time of capturing

2. Data Transfer

Whenever data transfer takes place, engineers should consider source and target. It doesn’t matter whether data is being transferred as part of the ETL process or general file transfers at the end of the business day.

When data is transferred, the mechanism may not be able to check whether the data is complete or valid. However, the tool must check for data consistency.

Consistency: The data is consistent across all the tables with the same values. This could translate to well-reconciled data between source and target, i.e. 100 records sent, 100 records received. Or that the table has specific values like date of birth and is consistent with other tables that have the same or similar information. Orphaned records (exist in A and not in B) should be highlighted, monitored and remediated.

3. Data Storage

Regardless of whether data gets transformed or consumed, it will spend most of its life in this layer. Once data has landed in this layer, it gets forgotten until it is required for a downstream use case. It’s better to utilise this time to improve the quality of the data in this layer to avoid the project panic when required shortly.

You can focus on these critical data quality dimensions:

Completeness: Null reporting — how many columns are Null, and why are they Null? Can we change the data capture process to avoid these Nulls coming through?

Uniqueness: Are non-mandatory attributes unique? Are duplications going to impact downstream reporting?

4. Data Transformation

Now — we get into data pipelines and ETL processes. This layer changes the data by adding aggregations and counts or adding more granularity by normalising. It is challenging to maintain the appropriate lineage of the data through this phase.

Ideally, the data quality should be acceptable before pipeline creation begins. However, this is rarely the case due to the sequence of events. Most of the time, the data quality itself is handled in the pipeline.

So what should we focus on:

Timeliness: We cannot execute a pipeline if we don’t get the latest data. So ensuring data is available promptly to meet agreed SLAs is crucial.

Consistency: Although challenging to execute, keeping the data consistent is essential. This should include relevant reconciliation checks from source to target, including intelligent data analysis. For example, tolerance checks on tables processed; we generally receive 100 records, and we have received just two records today; how do we alert the user of this discrepancy?

Validity: Before running through an expensive data pipeline, it will be wise to check the data for any validity issues. Non-conformance under the validity dimension could render the transformation and subsequent consumption useless. This is especially helpful when data capture doesn’t have robust controls.

5. Data Consumption

This layer is when you start to see the actual business value; everything before this was making sure you have the data readily available for this layer. At this layer, to ensure the business problem is solved, check two critical data quality dimensions:

Accuracy: The data is accurate enough for reporting, such as board metrics. Account numbers are associated with the correct customer segments, or the date of birth is not the default value like 01/01/1901.

Timeliness: The data is available at the time of reporting. It is not early that it excludes some recent records. It is not late that it misses the deadline for reporting. All agreed SLAs must be met to ensure the data consumption layer has the data available when required and stays fit for purpose.

Conclusion

Data quality has other dimensions like relevance, conformity, and integrity; however, I recommend applying the checks mentioned earlier at various points of your data journey to achieve the best bang for your buck. As you start to derive benefits, you may adjust your strategy accordingly.

If you found the article helpful, feel free to let me know by leaving a comment below. Check out my other post here on Medium:

If you are not subscribed to Medium, consider subscribing using my referral link. It’s cheaper than Netflix and objectively a much better use of your time. If you use my link, I earn a small commission, and you get access to unlimited stories on Medium.

I also write regularly on Twitter; follow me here.




Improve Quality of Your Data by Applying These Checks in Your Journey

Photo by Martin Adams on Unsplash

Imagine you have just delivered a fancy new Data Lake with some cool Data Pipelines that bring data from all over your organisation. Now imagine your dismay when your business teams realise the data in the Lake is all garbage.

Remember the adage: garbage in, garbage out.

This isn’t necessarily true; if garbage goes in, you can absolutely clean it on its journey and make sure only clean data is published. So let’s look at five places where you should clean your data.

If you are unfamiliar with Data Quality dimensions, I suggest you read №7 on this article first:

1. Data Capture

This is your first line of defence, a team of people working in your stores, your call centres or perhaps as online support agents. It could be your online sign-up forms or physical documents that your agents must manually input into your systems. Whatever is the method to collect data from your clients, it is imperative that at this point, data is complete, unique and valid.

Getting data captured correctly will save 3–4 times the effort in fixing it downstream in your other layers. So focus on these data quality dimensions:

Completeness: The data being collected and captured is complete (not NULLs), i.e. all mandatory fields have information added, and it is not missing any key data points.

Uniqueness: The data is kept as unique as possible, i.e. if a client already has an account, another account is not being set up. If the mobile number already exists in the system, the current order is linked to the old order etc.

Validity: The data being captured conforms to the corporate standards, i.e. account number is eight digits long and starts with a number 9 is conformed with at the time of capturing

2. Data Transfer

Whenever data transfer takes place, engineers should consider source and target. It doesn’t matter whether data is being transferred as part of the ETL process or general file transfers at the end of the business day.

When data is transferred, the mechanism may not be able to check whether the data is complete or valid. However, the tool must check for data consistency.

Consistency: The data is consistent across all the tables with the same values. This could translate to well-reconciled data between source and target, i.e. 100 records sent, 100 records received. Or that the table has specific values like date of birth and is consistent with other tables that have the same or similar information. Orphaned records (exist in A and not in B) should be highlighted, monitored and remediated.

3. Data Storage

Regardless of whether data gets transformed or consumed, it will spend most of its life in this layer. Once data has landed in this layer, it gets forgotten until it is required for a downstream use case. It’s better to utilise this time to improve the quality of the data in this layer to avoid the project panic when required shortly.

You can focus on these critical data quality dimensions:

Completeness: Null reporting — how many columns are Null, and why are they Null? Can we change the data capture process to avoid these Nulls coming through?

Uniqueness: Are non-mandatory attributes unique? Are duplications going to impact downstream reporting?

4. Data Transformation

Now — we get into data pipelines and ETL processes. This layer changes the data by adding aggregations and counts or adding more granularity by normalising. It is challenging to maintain the appropriate lineage of the data through this phase.

Ideally, the data quality should be acceptable before pipeline creation begins. However, this is rarely the case due to the sequence of events. Most of the time, the data quality itself is handled in the pipeline.

So what should we focus on:

Timeliness: We cannot execute a pipeline if we don’t get the latest data. So ensuring data is available promptly to meet agreed SLAs is crucial.

Consistency: Although challenging to execute, keeping the data consistent is essential. This should include relevant reconciliation checks from source to target, including intelligent data analysis. For example, tolerance checks on tables processed; we generally receive 100 records, and we have received just two records today; how do we alert the user of this discrepancy?

Validity: Before running through an expensive data pipeline, it will be wise to check the data for any validity issues. Non-conformance under the validity dimension could render the transformation and subsequent consumption useless. This is especially helpful when data capture doesn’t have robust controls.

5. Data Consumption

This layer is when you start to see the actual business value; everything before this was making sure you have the data readily available for this layer. At this layer, to ensure the business problem is solved, check two critical data quality dimensions:

Accuracy: The data is accurate enough for reporting, such as board metrics. Account numbers are associated with the correct customer segments, or the date of birth is not the default value like 01/01/1901.

Timeliness: The data is available at the time of reporting. It is not early that it excludes some recent records. It is not late that it misses the deadline for reporting. All agreed SLAs must be met to ensure the data consumption layer has the data available when required and stays fit for purpose.

Conclusion

Data quality has other dimensions like relevance, conformity, and integrity; however, I recommend applying the checks mentioned earlier at various points of your data journey to achieve the best bang for your buck. As you start to derive benefits, you may adjust your strategy accordingly.

If you found the article helpful, feel free to let me know by leaving a comment below. Check out my other post here on Medium:

If you are not subscribed to Medium, consider subscribing using my referral link. It’s cheaper than Netflix and objectively a much better use of your time. If you use my link, I earn a small commission, and you get access to unlimited stories on Medium.

I also write regularly on Twitter; follow me here.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Applyartificial intelligenceChecksDataHanzalaJourneylatest newsPointsQualityQureshiTechnology
Comments (0)
Add Comment