Techno Blender
Digitally Yours.

Fantastic Data Quality Issues and Where to Find Them

0 18


Data Quality Chronicles

Navigating the complexity of imperfect data

This is a column series that focuses on data quality for data science. This constitutes the first piece and focuses on Imbalanced Data, Underrepresented Data, and Overlapped Data.

Photo by Sergey Sokolov on Unsplash

Garbage in, garbage out. That is the curse of learning from data. In this piece, I’ll go over the importance of feeding high-quality data to your machine learning models and introduce you to killer data quality issues that, if left unchecked, may utterly compromise your data science projects.

From social to medical applications, machine learning has become deeply entangled in our daily lives.

Machine Learning in the wild: with great power, comes great responsibility. A reference to the “Coded Gaze” of Facial Recognition Technology and the work developed by Joy Buolamwini on the Algorithmic Justice League. Photo by engin akyurt on Unsplash

Perhaps you woke up today at 7:45am because an algorithm has been analyzing your sleep patterns and determined that that was the best time for you to start your day without feeling drowsy. Then, you may have driven to work through a route that was recommended by another algorithm, so that you can avoid traffic.

When you opened your laptop, your email was already sorted into these so-called “smart” folders and the spam was automatically filtered (by yet another algorithm!) so that you may focus only on the messages that matter.

And at the end of this very long day, maybe you have a blind date with someone whose profile has hand-picked (well, script-picked?) for you among thousands of possibilities. Again, by another algorithm.

When technology becomes so pervasive as machine learning currently is, we would be wise to zero in on these models and the way they learn, because although AI has a great potential to serve society, it also has a great power for destruction and inequality.

And why is that?

The reason is that algorithms learn from what we teach them.

They learn from the data we feed them and they expect that data to be “well-behaved” in what concerns several of its properties.

Ideally, that would be the case. But our world is imperfect, we are imperfect, and the data we generate naturally carries out those imperfections.

Data (or Big Data, a word we’ve been hearing more than often over the past years) is not the same as Quality Data, and mistaking these two may lead to the development of biased and unfair models, rather than accurate and reliable models.

Data versus Quality Data. Image by Author.

Traditionally, machine learning algorithms rely on a few assumptions regarding the training data, such as:

  • Existing classes are equally represented;
  • Existing sub-concepts in data are also equally represented;
  • Instances from different classes occupy different regions of the input space;
  • There is a sufficiently large number of training instances to learn the underlying concepts in data;
  • Feature values are consistent and instances are correctly labeled;
  • Features are informative and relevant for the end task;
  • Training and test data follow the same distribution;
  • All feature values are available for all instances.

Naturally, in real-world domains, imperfection is always lurking and these assumptions are most often broken than not.

When they are broken, they arise as data imperfections, respectively:

  • Imbalanced Data;
  • Underrepresented Data or Small Disjuncts;
  • Class Overlap;
  • Small Data, Lack of Density, or Lack of Data;
  • Inconsistent Data;
  • Irrelevant Data;
  • Redundant Data;
  • Noisy Data;
  • Dataset Shift;
  • Missing Data.

If left untreated, these imperfections may jeopardize the performance of standard machine learning models with nefarious consequences for business applications and people’s lives.

An erroneous alert of credit card fraud that lead to the loss of a critical investment. A failed tumor detection that transformed into the hard choice between a painful course of treatment or an end-of-life decision. A misjudgment between individuals with similar face structures that mistakenly sentences one to face the law and sets the other one free.

Imperfection may cost us our money, freedom, and lives.

Before I go on details regarding these data imperfections, I’d like to clarify the concept of Imperfect Data.

In my own research, I have used this word as an umbrella term to describe any data properties, idiosyncrasies, or issues that are prone to bias the behavior and performance of classifiers (other authors describe them as data intrinsic characteristics, data difficulty factors, or data irregularities).

This means that certain “imperfections” are not to be taken in the literal sense of the word (which could translate to defective data to some extent).

Certainly, some imperfections may arise due to errors in the data acquisition, transmission, and collection processes, but others are a natural product of the intrinsic nature of the domains. They arise naturally, irrespectively of how flawless the process of data acquisition, transmission, or collection may be.

The 3 data imperfections covered here — Imbalanced Data, Underrepresented Data, and Ovelapped Data — are a fantastic example of this. They most often result from the nature of the domain itself rather than from any mistakes made during data collection or storage.

Imbalanced Data generally refers to a disproportion of the number of examples of each class in a dataset

In other words, classes are not equally represented in the domain, which biases the learning process of classifiers towards well-represented concepts, causing them to potentially overlook or disregard the remaining. This is problematic since in most applications the minority class is usually the class of interest.

And where do we find it?

Well, some examples are disease diagnosis, credit card fraud, sentiment analysis, and churn prediction.

An interesting twist: Class imbalance per se may not be the issue!

Indeed, even in the presence of a highly imbalanced domain, a standard classifier might be able to obtain satisfying results if the classification problem is of low complexity (e.g., consider a linearly separable domain, for instance).

Imbalanced Data in isolation versus combined with class overlap. Both domains contain the same number of points (500) and imbalance ratio (8:1). Image by Author.
Imbalanced Data in isolation versus combined with class overlap. Both domains contain the same number of points (500) and imbalance ratio (8:1). Image by Author.

Nevertheless, although class imbalance may be easy to overcome in isolation, it should always be taken into account when training machine learning models, especially in what concerns the design of appropriate cross-validation approaches and the choice of unbiased classification performance measures.

Underrepresented data is another form of imbalanced data

Whereas in the previous case we were referring to between-class imbalance, underrepresented data is associated with a within-class imbalance phenomena and arises in the form of small disjuncts.

Small disjuncts are small, underrepresented sub-concepts in data, understood as small clusters within a class concept.

Underrepresented Data is characterised by the appearance of small sub-clusters in data. Image by Author.
Underrepresented Data is characterised by the appearance of small sub-clusters in data. Image by Author.

Similarly to between-class imbalance, small disjuncts are problematic because classifiers often learn by generating rules for well-represented concepts, i.e., larger disjuncts. Thus, they become susceptible to overfit these sub-concepts, which leads to a poor classification performance for new examples.

And where to find them?

The appearance of small disjuncts is very common in healthcare data, due to the heterogeneity of some diseases (such as cancer) and the biological diversity among patients. Other examples are facial and emotional recognition.

A current open challenge in research nowadays is distinguishing between core concepts (even if appearing as clusters in the data space), underrepresented sub-concepts or small disjuncts, and noisy instances. This is not a trivial issue per se, and it becomes more complicated if other problems are present in data (and they usually are).

Class overlap occurs when instances from different classes coexist in the same region of the data space

As representatives of different concepts populate the same regions, machine learning classifiers have a hard time discriminating them, which leads to poor classification performance (especially affecting the less represented concepts in those regions).

Typical examples of class overlap: domains have an increasing amount of overlapping examples. Image by Author.
Typical examples of class overlap: domains have an increasing amount of overlapping examples. Image by Author.

Over the years, researchers have been approaching this issue either by learning solely from non-overlapped regions (somewhat neglecting the problem), treating the overlapped data as a new class, or building separate classifiers for overlapped and non-overlapped regions.

Other authors try to distinguish between examples scattered throughout the entire input space and those that concentrate on the decision boundaries between concepts, applying tailored strategies to handle each type differently.

Current research is now shifting towards the idea that class overlap is an heterogeneous concept, comprising multiple sources of complexity. In preliminary work, I’ve particularly distinguished it between 4 main overlap representations, dividing it into Feature Overlap, Instance Overlap, Structural Overlap, and Multiresolution Overlap, each associated with distinct complexity concepts.

And where do we find it?

From character recognition, software defect prediction, and protein and drug discovery, class overlap is also a common data characteristic found in real-world domains.

Whereas the past decades in AI research have been dedicated to producing better models — a paradigm we have been calling Model-Centric AI — current focus has been shifting from model optimization and hyper parameter tuning to systematic identification and mitigation of data quality issues — a paradigm recently coined as Data-Centric AI.

In the “AI Tower of Babel” we currently live in, truly understanding data and pointing to what matters will prove more transformative than having huge amounts of “information”. This pointing to is the basis of the new Data-Centric AI paradigm. Photo by Killian Cartignies on Unsplash

This new approach comprehends a systematic and continuous cycle of iterations over the data, moving from imperfect to smart and actionable data. That naturally requires a deep understanding of data imperfections, their identification and characterization, as well as their combined effects and efficient mitigation strategies.

The Data Quality Chronicles Series introduces the topic of data quality for data science, starting with 3 common data quality issues found in real-world domains: Imbalanced Data, Underrepresented Data, and Overlapped Data. The following parts of the series will be dedicated to characterizing other data quality issues, deep-diving into each one, and introducing the reader to efficient tools and strategies to effectively identify and measure them when handling real-world datasets.

About me

Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall “jack-of-all-trades”. Here on Medium, I write about Data-Centric AI and Data Quality, educating the Data Science & Machine Learning communities on how to move from imperfect to intelligent data.

Google Scholar | LinkedIn | Data-Centric AI Community

  1. B. Krawczyk, Learning from Imbalanced Data: Open Challenges and Future Directions (2016), Progress in Artificial Intelligence, 5(4), 221–232.
  2. S. Das, S. Datta, B. Chaudhuri, Handling data irregularities in classification: Foundations, trends, and future challenges (2018), Pattern Recognition 81, 674–693.
  3. A. Fernández, S. García, M. Galar, M., R. Prati, B. Krawczyk, F. Herrera, Data Intrinsic Characteristics (2018), Springer International Publishing. pp. 253–277.
  4. I. Triguero, D. García-Gil, J. Maillo, J. Luengo, S. García, F. Herrera, Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data (2019), Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9, e1289.
  5. M. Santos, P. Abreu, N. Japkowicz, A. Fernández, J. Santos, A Unifying View of Class Overlap and Imbalance: Key concepts, multi-view panorama, and open avenues for research (2023), Information Fusion 89, 228–253.


Data Quality Chronicles

Navigating the complexity of imperfect data

This is a column series that focuses on data quality for data science. This constitutes the first piece and focuses on Imbalanced Data, Underrepresented Data, and Overlapped Data.

Photo by Sergey Sokolov on Unsplash

Garbage in, garbage out. That is the curse of learning from data. In this piece, I’ll go over the importance of feeding high-quality data to your machine learning models and introduce you to killer data quality issues that, if left unchecked, may utterly compromise your data science projects.

From social to medical applications, machine learning has become deeply entangled in our daily lives.

Machine Learning in the wild: with great power, comes great responsibility. A reference to the “Coded Gaze” of Facial Recognition Technology and the work developed by Joy Buolamwini on the Algorithmic Justice League. Photo by engin akyurt on Unsplash

Perhaps you woke up today at 7:45am because an algorithm has been analyzing your sleep patterns and determined that that was the best time for you to start your day without feeling drowsy. Then, you may have driven to work through a route that was recommended by another algorithm, so that you can avoid traffic.

When you opened your laptop, your email was already sorted into these so-called “smart” folders and the spam was automatically filtered (by yet another algorithm!) so that you may focus only on the messages that matter.

And at the end of this very long day, maybe you have a blind date with someone whose profile has hand-picked (well, script-picked?) for you among thousands of possibilities. Again, by another algorithm.

When technology becomes so pervasive as machine learning currently is, we would be wise to zero in on these models and the way they learn, because although AI has a great potential to serve society, it also has a great power for destruction and inequality.

And why is that?

The reason is that algorithms learn from what we teach them.

They learn from the data we feed them and they expect that data to be “well-behaved” in what concerns several of its properties.

Ideally, that would be the case. But our world is imperfect, we are imperfect, and the data we generate naturally carries out those imperfections.

Data (or Big Data, a word we’ve been hearing more than often over the past years) is not the same as Quality Data, and mistaking these two may lead to the development of biased and unfair models, rather than accurate and reliable models.

Data versus Quality Data. Image by Author.

Traditionally, machine learning algorithms rely on a few assumptions regarding the training data, such as:

  • Existing classes are equally represented;
  • Existing sub-concepts in data are also equally represented;
  • Instances from different classes occupy different regions of the input space;
  • There is a sufficiently large number of training instances to learn the underlying concepts in data;
  • Feature values are consistent and instances are correctly labeled;
  • Features are informative and relevant for the end task;
  • Training and test data follow the same distribution;
  • All feature values are available for all instances.

Naturally, in real-world domains, imperfection is always lurking and these assumptions are most often broken than not.

When they are broken, they arise as data imperfections, respectively:

  • Imbalanced Data;
  • Underrepresented Data or Small Disjuncts;
  • Class Overlap;
  • Small Data, Lack of Density, or Lack of Data;
  • Inconsistent Data;
  • Irrelevant Data;
  • Redundant Data;
  • Noisy Data;
  • Dataset Shift;
  • Missing Data.

If left untreated, these imperfections may jeopardize the performance of standard machine learning models with nefarious consequences for business applications and people’s lives.

An erroneous alert of credit card fraud that lead to the loss of a critical investment. A failed tumor detection that transformed into the hard choice between a painful course of treatment or an end-of-life decision. A misjudgment between individuals with similar face structures that mistakenly sentences one to face the law and sets the other one free.

Imperfection may cost us our money, freedom, and lives.

Before I go on details regarding these data imperfections, I’d like to clarify the concept of Imperfect Data.

In my own research, I have used this word as an umbrella term to describe any data properties, idiosyncrasies, or issues that are prone to bias the behavior and performance of classifiers (other authors describe them as data intrinsic characteristics, data difficulty factors, or data irregularities).

This means that certain “imperfections” are not to be taken in the literal sense of the word (which could translate to defective data to some extent).

Certainly, some imperfections may arise due to errors in the data acquisition, transmission, and collection processes, but others are a natural product of the intrinsic nature of the domains. They arise naturally, irrespectively of how flawless the process of data acquisition, transmission, or collection may be.

The 3 data imperfections covered here — Imbalanced Data, Underrepresented Data, and Ovelapped Data — are a fantastic example of this. They most often result from the nature of the domain itself rather than from any mistakes made during data collection or storage.

Imbalanced Data generally refers to a disproportion of the number of examples of each class in a dataset

In other words, classes are not equally represented in the domain, which biases the learning process of classifiers towards well-represented concepts, causing them to potentially overlook or disregard the remaining. This is problematic since in most applications the minority class is usually the class of interest.

And where do we find it?

Well, some examples are disease diagnosis, credit card fraud, sentiment analysis, and churn prediction.

An interesting twist: Class imbalance per se may not be the issue!

Indeed, even in the presence of a highly imbalanced domain, a standard classifier might be able to obtain satisfying results if the classification problem is of low complexity (e.g., consider a linearly separable domain, for instance).

Imbalanced Data in isolation versus combined with class overlap. Both domains contain the same number of points (500) and imbalance ratio (8:1). Image by Author.
Imbalanced Data in isolation versus combined with class overlap. Both domains contain the same number of points (500) and imbalance ratio (8:1). Image by Author.

Nevertheless, although class imbalance may be easy to overcome in isolation, it should always be taken into account when training machine learning models, especially in what concerns the design of appropriate cross-validation approaches and the choice of unbiased classification performance measures.

Underrepresented data is another form of imbalanced data

Whereas in the previous case we were referring to between-class imbalance, underrepresented data is associated with a within-class imbalance phenomena and arises in the form of small disjuncts.

Small disjuncts are small, underrepresented sub-concepts in data, understood as small clusters within a class concept.

Underrepresented Data is characterised by the appearance of small sub-clusters in data. Image by Author.
Underrepresented Data is characterised by the appearance of small sub-clusters in data. Image by Author.

Similarly to between-class imbalance, small disjuncts are problematic because classifiers often learn by generating rules for well-represented concepts, i.e., larger disjuncts. Thus, they become susceptible to overfit these sub-concepts, which leads to a poor classification performance for new examples.

And where to find them?

The appearance of small disjuncts is very common in healthcare data, due to the heterogeneity of some diseases (such as cancer) and the biological diversity among patients. Other examples are facial and emotional recognition.

A current open challenge in research nowadays is distinguishing between core concepts (even if appearing as clusters in the data space), underrepresented sub-concepts or small disjuncts, and noisy instances. This is not a trivial issue per se, and it becomes more complicated if other problems are present in data (and they usually are).

Class overlap occurs when instances from different classes coexist in the same region of the data space

As representatives of different concepts populate the same regions, machine learning classifiers have a hard time discriminating them, which leads to poor classification performance (especially affecting the less represented concepts in those regions).

Typical examples of class overlap: domains have an increasing amount of overlapping examples. Image by Author.
Typical examples of class overlap: domains have an increasing amount of overlapping examples. Image by Author.

Over the years, researchers have been approaching this issue either by learning solely from non-overlapped regions (somewhat neglecting the problem), treating the overlapped data as a new class, or building separate classifiers for overlapped and non-overlapped regions.

Other authors try to distinguish between examples scattered throughout the entire input space and those that concentrate on the decision boundaries between concepts, applying tailored strategies to handle each type differently.

Current research is now shifting towards the idea that class overlap is an heterogeneous concept, comprising multiple sources of complexity. In preliminary work, I’ve particularly distinguished it between 4 main overlap representations, dividing it into Feature Overlap, Instance Overlap, Structural Overlap, and Multiresolution Overlap, each associated with distinct complexity concepts.

And where do we find it?

From character recognition, software defect prediction, and protein and drug discovery, class overlap is also a common data characteristic found in real-world domains.

Whereas the past decades in AI research have been dedicated to producing better models — a paradigm we have been calling Model-Centric AI — current focus has been shifting from model optimization and hyper parameter tuning to systematic identification and mitigation of data quality issues — a paradigm recently coined as Data-Centric AI.

In the “AI Tower of Babel” we currently live in, truly understanding data and pointing to what matters will prove more transformative than having huge amounts of “information”. This pointing to is the basis of the new Data-Centric AI paradigm. Photo by Killian Cartignies on Unsplash

This new approach comprehends a systematic and continuous cycle of iterations over the data, moving from imperfect to smart and actionable data. That naturally requires a deep understanding of data imperfections, their identification and characterization, as well as their combined effects and efficient mitigation strategies.

The Data Quality Chronicles Series introduces the topic of data quality for data science, starting with 3 common data quality issues found in real-world domains: Imbalanced Data, Underrepresented Data, and Overlapped Data. The following parts of the series will be dedicated to characterizing other data quality issues, deep-diving into each one, and introducing the reader to efficient tools and strategies to effectively identify and measure them when handling real-world datasets.

About me

Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall “jack-of-all-trades”. Here on Medium, I write about Data-Centric AI and Data Quality, educating the Data Science & Machine Learning communities on how to move from imperfect to intelligent data.

Google Scholar | LinkedIn | Data-Centric AI Community

  1. B. Krawczyk, Learning from Imbalanced Data: Open Challenges and Future Directions (2016), Progress in Artificial Intelligence, 5(4), 221–232.
  2. S. Das, S. Datta, B. Chaudhuri, Handling data irregularities in classification: Foundations, trends, and future challenges (2018), Pattern Recognition 81, 674–693.
  3. A. Fernández, S. García, M. Galar, M., R. Prati, B. Krawczyk, F. Herrera, Data Intrinsic Characteristics (2018), Springer International Publishing. pp. 253–277.
  4. I. Triguero, D. García-Gil, J. Maillo, J. Luengo, S. García, F. Herrera, Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data (2019), Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9, e1289.
  5. M. Santos, P. Abreu, N. Japkowicz, A. Fernández, J. Santos, A Unifying View of Class Overlap and Imbalance: Key concepts, multi-view panorama, and open avenues for research (2023), Information Fusion 89, 228–253.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment