Interpreting EDA: Chapter I. Ever wondered how data tell stories… | by Dhruv Gangwani | Jun, 2022

By Jessie Hobb On Jun 2, 2022

Ever wondered how data tell stories? Visualisations narrate them.

In 2017, data surpassed oil to become the most valuable asset on this planet. Data is generated in every sector in abundant amounts. According to cloud tweaks, at least 2.5 quintillion bytes of data is produced every day (that’s 2.5 followed by a staggering 18 zeros). That’s amazing. But, the human brain is incapable to indulged even 1% of such huge data. So, to bring the flattered files of data to life and turn them into information, we need exploratory data analysis. John Tukey, often known as the “Father of exploratory data analysis,” said,

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

Photo by Anastasiia Malai on Unsplash

The term “Interpreting” in the title exists for a reason. Together, we will understand the conclusions that can be extracted from some common visualizations. Visualizations are not just beautiful, eye-catching pictures, but each type of plot has its own story to narrate. In this first part of the blog series, I will cover three important types of plots, namely Histogram, scatter plot, and count plot. All the visualizations are computed on the famous “Iris Dataset,” which is about classifying three species of flowers based on petal and sepal features.

Histograms are employed to analyze and understand the distribution of the sample. It might be the normal distribution, left-skewed, right-skewed, uniform, or others. One reason to understand the distribution is that it can help in dealing with missing values such as,

If the distribution is left-skewed or right-skewed, then we use the median to replace null values as the median is not affected by extreme values.
If it is normal distribution, then use the mean to replace the null values.

For comparing two distributions, it is necessary to ensure that the same values are on the X and Y axis and that the bin size is the same. The size of bins can affect the shape of the distribution. A histogram helps to understand whether the distribution is unimodal, bimodal, or multimodal. Also, it tells about how widely the distribution is spread. It helps to identify outliers and high-leverage points in the dataset.

As far as bin size is concerned, There can be many ways to choose the size of bins, but one of them is Sturge’s rule which is,

K = 1 + 3.22 logN

where,

K is the number of bins

N is the number of observations

As you carefully look at the histograms of sepal width, they are approximately normal distributions. While the other three are not for the following reasons,

Petal length and petal width are not a normal distribution. It is not symmetric around the mean for a bin size of 1 and 0.5, respectively.
In the case of sepal length, it is bimodal concerning bin size 0.5.

If you want to understand the normal distribution in detail, you can check out my last blog “Demystifying Estimation: The Basics”. The issue with histogram is that its shape changes with changing bin size. Another way to check whether or not the distribution is normal is to employ a normal q-q plot which we’ll go through in the next part of the series.

A scatter plot helps to understand the nature of the relationship between two continuous variables. It can be,

Linear relationship

Positive: Both variables increase together. Uphill from left to right
Negative: One variable increases and another decrease. Downhill from left to right

2. Non-linear relationship: Curvy pattern

3. No pattern: No correlation

Scatterplots with a linear pattern have points that seem to generally fall along a line while nonlinear patterns seem to follow along some curve. Whatever the pattern is, we use this to describe the association between the variables. If there is no clear pattern, then it means there is no clear association or relationship between the variables that we are studying. In the case of perfectly linear plots, we can get a relationship like X increases makes z times Y. More closer the points, the better the correlation between them.

Correlation does not imply causation. If a relationship exists, It does not necessarily mean that an increase/decrease in X causes an increase/decrease in Y. In case any feature is scatter plotted against the target variable, and it has a high correlation(positive/negative), then it means that it can explain the target variable very well and is a good feature to predict the target variable.

It helps to understand:

Outliers: A point that is distant from other points but in the path of the regression line. It does not fit in the regression line and may influence the line by a large margin.
High-leverage observation: away from regression line/points pattern. It changes the regression line but not by a large margin.

The issue in the scatter plot is that the human eye fails to understand overlapping points.

The above plot is the pair plot. The pair plot includes the scatter plot and histogram of all the probable combinations of the features.
The relationship between petal length and petal width is most linear, which means they hold a strong relationship.
As the data points fall uphill from left to right, they have a positive relationship. It means that there is an increase in petal length with an increase in petal width.
Petal length and sepal length are also one of the most linear plots with positive relationships.
Roughly, there are no major outliers in the plot, but maybe there are some high-leverage points.
In all the scatter plots, a segment of points is separated from most data points. To dig deeper, I have plotted the pair plot with the target variable “Species” as the third variable.

As seen in the above scatter plots, the ‘Iris-Setosa’ is the group that is separated from the other two, especially in terms of petal length and petal width.
The histogram of petal length and petal width says that the length and width of ‘Iris Setosa’ are less than the other two classes.
Therefore, petal length and petal width are important features to classify the “Iris-Setosa” class from the other two classes.
Also, the other two classes are difficult to classify. Still, concerning petal length and petal width, they are mostly separable with a thin gap in between.

As this is a classification problem, It is necessary to check whether or not the target feature is skewed, i.e., imbalanced. If it is imbalanced, we can opt for either of the ways,

Random oversampling is duplicating the observation of the least frequent class to match the frequency of the most frequent class.
Random undersampling is removing the observation of the least frequent class to match the frequency of the most frequent class.
Creating synthetic data using techniques such as SMOTE.

It is important to understand that the aforementioned techniques are applied only to the training datasets. Else, it would add bias to the result.

As depicted in the above count plot, the target variable has no imbalance/skewness. So, we don’t need any techniques to balance data.

In this first blog of the series, we learned how to interpret the histogram, scatter plot, and count plot. We learned two kind types of interpretations,

General interpretation: It is about what kind of conclusions can be drawn from visualization. For example, a scatter plot shows the nature of the relationship between two continuous variables.
Project-specific interpretation: It is about what we learned from the plot. For example, the petal length and petal width scatter plot shows a positive linear relationship. Therefore, they can be considered good features for classifying Iris-Setosa from the other two classes.

In the next chapters of this series, we will learn about other types of visualizations such as violin plots, normal q-q plots, swarm plots, heat maps, and box plots. I have implemented the exploratory data analysis on this dataset using the seaborn library in python. You can visit my Kaggle notebook “Iris: Exploratory Data Analysis” for a code walkthrough.

That’s all, folks. Happy learning.