Common Issues that Will Make or Break Your Data Science Project | by Jason Chong | Oct, 2022

By Jessie Hobb On Oct 13, 2022

A useful guide on spotting data problems, why they can be detrimental, and how to properly address them

I believe most people would be familiar with the survey that indicates data scientists spend about 80% of their time preparing and managing data. That’s 4 out of 5 days in the workweek!

Though this may sound insane (or boring), you quickly realize why this trend exists, and I think it goes to show the importance of data cleaning and data validation.

Rubbish in, rubbish out.

Getting your data right is more than half the battle won in any analytics project. In fact, no fancy or complicated model will ever be sufficient to compensate for low-quality data.

For beginners who are just starting out in this field (certainly the case for me), I understand it can be difficult to know what exactly to look out for when dealing with a new dataset.

It is with this in mind, I want to present a guide of common data issues that you will stumble upon at some point in your journey, along with a framework on how to properly deal with these issues as well as their respective trade-offs.

This blog post is not an exhaustive list by any stretch, but rather the crucial and most frequent ones you will find when preprocessing and interpreting your data.

Duplicates are quite simply repeated instances of the same data in the same table and in most cases, duplicates should be completely removed.

There are built-in functions in most programming languages these days that can help detect duplicate data, for example, the duplicated function in R.

On the note of handling duplicates, it is also important to understand the concept of primary keys.

A primary key is a unique identifier for each row in a table. Every row has its own primary key value, and it should not repeat at all. For example, in a customer table, this could be the customer ID field or in a transaction dataset, this could be the transaction ID.

Identifying the primary key in a table is a great way to check for duplicates. Specifically, the number of distinct values in the primary key needs to be equal to the number of rows in the table.

If they are equal then great. If not, you need to investigate further.

A primary key doesn’t necessarily have to be just a single column. Multiple columns can form the primary key for a table.

Generally, there are two types of missing values: NA (short for not available) and NAN (short for not a number).

NA means missing data for unknown reasons whereas NAN means there is a result but cannot be represented by a computer, for example, an imaginary number or if you accidentally divide any number by zero.

Missing values can cause our models to fail or lead to the wrong interpretations, therefore, we need to find ways to address them. There are mainly two ways to address missing values: omit observations with missing values or imputation.

In the event of a massive dataset, we can simply drop all the missing data, however, we run the risk of losing information and this would not be suitable for small datasets.

Value imputation, on the other hand, can be categorized into univariate and multivariate imputation. I have written a blog post in the past that talks about imputation, feel free to check it out if you are interested in diving deeper into the topic.

Effectively, univariate imputation means to substitute values based on a single column using either mean, median, or mode values. Multivariate imputation, on the other hand, considers multiple columns and involves the use of algorithms. For example, the simplest model to impute continuous variables would be a linear regression or k-means clustering for categorical variables. Multivariate imputation is usually preferred over univariate imputation as it offers a more accurate prediction of the missing data.

Judgment will be required on the best way to deal with missing values. There could also be situations where a claim of a zero payment is denoted NA rather than 0 in a dataset. In this particular scenario, it makes sense to simply replace NA with 0. In other cases, more consideration will need to be involved.

Outliers are data points that differ substantially from the rest. They can skew an analysis or model.

One way of determining an outlier is to apply the interquartile range (IQR) criterion to the variable where if the observation is more than 1.5*IQR above the upper quartile and 1.5*IQR below the quartile, then it is considered an outlier.

Box plots usually plot these points as dots past the whisker points, and this is considered a univariate approach for detecting outliers. Histograms are equally great at visualizing distributions and spotting potential outliers. For two variables, consider using scatter plots.

How to deal with outliers? Well, you can either keep, drop, cap, or impute them using mean, median, or a random number.

This is not so much a data issue, but more so a reminder of how to interpret variables that are correlated with each other.

Machine learning models are great at learning relationships between input data and output predictions. However, they lack reasoning about the cause and effect. Hence, one must be careful when drawing conclusions and not over-interpret the associations between variables.

There is a famous quote in statistics, “correlation does not imply causation”.

There are several reasons that a variable is correlated to a variable without necessarily having a direct effect. These reasons may include spurious correlation, outliers, and confounders.

We will discuss them in more detail here.

Correlation does not imply causation.

4.1 Spurious correlation

Spurious correlation is when two variables are somehow correlated but in reality, there is no real relationship between them.

Image by Tyler Vigen — Spurious Correlations (CC BY 4.0)

As seen from the chart above, a comical example is the correlation between cheese consumption and deaths by becoming tangled in bedsheets. Obviously, neither variables have any logical causal effect on the other and this is nothing more than mere coincidence.

This is one of many other examples that you can find here.

4.2 Correlation caused by outliers

Correlations can also sometimes be driven by outliers.

We can test this by removing the outliers, the correlation should significantly decrease as a result. This reinforces the importance of identifying outliers when exploring a dataset, as mentioned in the previous section.

Alternatively, we can also compute Spearman correlations instead of Pearson correlations as Spearman computes correlation is based on the rank order of the values, therefore not susceptible to the influence of outliers.

4.3 Correlation caused by confounding variables

Confounders are probably the most common reason for correlations being misinterpreted. If variables X and Y are correlated, we call Z a confounder if changes in Z cause changes in both X and Y.

For example, suppose you want to investigate the mortality rates between two groups, one group that consists of heavy alcohol drinkers and another consisting of those who never drink alcohol. The mortality rate would be the response variable and alcohol consumption would be your independent variable.

If you find heavy drinkers are more likely to die, it might seem intuitive to conclude that alcohol use increases the risk of death. However, alcohol use is likely to not be the only mortality-affecting factor that differs between the two groups. For example, those who never drink alcohol may be more likely to have a healthier diet or less likely to smoke, both of which also have an effect on mortality. These other influencing factors (diet and smoking habits) are called confounding variables.

So, how do we address this? For a small number of confounders, we could use a method called stratification, that is sampling data in which confounding variables do not vary drastically, and then examine the relationship between the independent and dependent variables in each group.

Looping back to our previous example, we could divide the sample into groups of smokers and non-smokers and then examine the relationship between alcohol consumption and mortality within each.

Theoretically, this suggests that we should include all explanatory variables that have a relationship with the response variable. Unfortunately, sometimes not all confounders may be possible to be collected or accurately measured. Furthermore, adding too many explanatory variables is likely to introduce multicollinearity and increase variance around the regression estimate.

This is a trade-off that happens between precision and bias. As we start to include more variables, we reduce the bias in our predictions but multicollinearity increases, and as a result, variance also increases.

Feature engineering is the process of selecting and transforming raw data into features to be used to train a model.

When preparing our dataset, we need to know what type of variables are in each column, so that they can be used appropriately to solve a regression or classification problem.

Some considerations that are important to think about include:

What are my features and their properties?
How do my features interact with each other to fit a model?
How can I adjust my raw features to represent raw predictors?

By examining summary statistics, we are usually able to determine the properties of our features. However, choosing or constructing the right features is not an easy task and often comes down to experience and expertise in the domain.

Nevertheless, below are 3 examples of feature engineering you can consider doing for your project.

5.1 Categorical variables for algorithms that are unable to handle them

For algorithms that are unable to handle categorical variables such as logistic regression and support vector machines, which expect all variables to be numeric, the popular approach would be to convert them into n numerical variables, with each variable taking a value of 1 or 0. This is called one-hot encoding.

Regression problems use a slight variation of one-hot encoding, called dummy encoding. The difference is that one-hot encoding generates n-1 numerical variables.

As you can see, for dummy encoding, if we know the values of two variables, we can easily deduce the value of the third variable. Specifically, if two variables have values of 0, this implies that the third variable would have a value of 1. In doing so, we avoid giving our regression model redundant information that may result in non-identifiability.

There are packages in common programming languages like Python and R that enable one-hot encoding and dummy encoding.

5.2 Categorical variables with high cardinality

High cardinality means having too many unique values.

Algorithms such as decision trees and generalized linear models (GLMs) are unable to handle categorical data with high cardinality.

Decision trees split features such that each sub-tree becomes as homogeneous as possible. Therefore, the number of splits grows as cardinality grows, increasing model complexity.

GLMs, on the other hand, create dummy variables for each level of categorical data. Therefore, for each categorical variable with n categories, the model will generate n-1 additional parameters. This is not ideal as it can lead to overfitting and poor out-of-sample predictions.

One popular way to deal with categorical variables with high cardinality is binning, which is the process of combining classes of categorical variables that are similar. This grouping often requires domain knowledge of the business environment or knowledge gained from data exploration, for example, by examining the frequency of the levels and analyzing the relationship between the variable of interest and the response variable. After binning the categories, one-hot encoding can be used to transform the categorical variables to dummy numeric variables with values 1 and 0.

Some examples of binning include grouping countries into continents or life expectancy into ranges of 0–50 years and 50+ years.

5.3 High number of features or variables in the dataset

Similar to high cardinality, when we have too many features in a dataset, we are faced with the challenges of long training times as well as overfitting. Having too many features also makes data visualization difficult.

As a beginner starting out in data science, I remember thinking that it is always better to have more variables than fewer variables when building a model, but this is not true.

Reducing the number of variables is crucial in simplifying a dataset so that we only need to focus on features that are actually meaningful and likely to carry the most signal rather than noise.

There are several ways to reduce the number of features, but here I will share 4 ways you may want to consider:

Domain knowledge: Manually using domain knowledge if experience or expertise tells you that certain variables perform well at predicting the response variable. For example, debt-to-income ratio is a common metric used to assess a person’s creditworthiness and predict default probability.
Dimension reduction: Reduce the number of variables by projecting points into a lower dimensional space. The aim is to get a new set of features that are fewer than the original but still preserve as much information as possible. A popular dimension reduction technique is called the Principal Component Analysis (PCA). The new features under PCA should explain the original features, in other words, are highly correlated but not with any of the new features. I have written about the PCA algorithm in the past here if you are interested to learn about it in more detail.

Subset selection: Find a subset of variables that perform well and remove variables that are redundant through a process called stepwise regression. This can either be through a forward selection or backward elimination process. Forward selection involves starting with no variables and at each step incrementally adding one variable that generates the most improvement to model fit. Backward selection, on the other hand, involves starting with all variables and at each step removing one variable that gives the most insignificant deterioration to model fit.
Shrinkage: Techniques such as LASSO and Ridge regression minimize the possibility of overfitting and underfitting by adding a penalty term to the residual sum of squares (RSS). I won’t go into too much detail but feel free to read up on these techniques here.

An imbalanced dataset occurs when we have clear minority classes that are sparse and minority classes that are in abundance.

This is an issue in classification problems because our model does not get enough information about the minority class in order to make an accurate prediction. Specifically, because of the imbalance, models are more likely to show a bias for the majority class which can potentially lead to misleading conclusions.

Common examples of imbalanced datasets can be found in fraud detection, customer churn, and loan default.

Let’s take fraud detection as an example. Fraudulent transactions typically only make up a tiny proportion in a large dataset (you would hope so, otherwise everyone would avoid using the bank). Suppose there may be only 1 case of fraud in every 1,000 transactions, representing 0.1% of the full dataset. If a machine learning algorithm simply predicted that 100% of transactions are not fraudulent, in this particular instance, it would have an accuracy rate of 99.9% which may seem extremely high at the surface level.

However, if the bank was to implement this model, it is likely that it will be unable to flag future fraudulent transactions and this can be costly to the bank.

There are a few methods to treat imbalanced data with sampling methods:

Undersampling: This is where we decrease the number of samples of the majority class. The disadvantage of undersampling is that we will lose a lot of valuable data.
Oversampling: This is where we increase the number of samples of the minority class. The disadvantage of oversampling is that we create excessive duplicate data points, which may cause our model to overfit.
Synthetic Minority Oversampling Technique (SMOTE): SMOTE aims to strike a balance between undersampling and oversampling. The advantage of SMOTE is that we are not creating duplicates, but rather data points that are slightly different from the original data points.

Like I said at the beginning of this blog post, this is, by no means, an exhaustive list of things you need to look out for but I hope this guide will help you not only become more aware of these problems in the future but be also equipped on how to deal with them.

If you found any value in this article and are not yet a Medium member, it would mean a lot to me as well as the other writers on this platform if you sign up for membership using the link below. It encourages us to continue putting out high-quality and informative content just like this one — thank you in advance!

Don’t know what to read next? Here are some suggestions.