Fundamentals of Statistics All Data Scientists & Analysts Should Know — With Code — Part 1 | by Zoumana Keita

This article is a comprehensive overview of the fundamentals of statistics for Data Scientists and Data Analysts

Building Machine Learning models are cool for making predictions. However, they are not suitable when it comes to having a better understanding of your business problem, which requires the most time in statistical modeling.

This article will first try to build your understanding of the fundamentals of statistics that can be beneficial for Data Scientists and Data Analysts’ day-to-day activities in order to help the Business make actionable decisions. It will also guide you through hands-on to practice those statistical concepts using Python.

Before starting to work with data, let’s first understand the concept of population and sample.

→ A population is the set of all items you are interested in (events, people, objects, etc.). In the image below the population is made of seven people.

→ A sample on the other hand is just a subset of a population. The sample from the image contains two people.

Illustration of sample and population (Image by Author)

In real life, it is hard to find and observe the populations. However, gathering a sample is less time-consuming, and cheaper. Those are the main reason why we prefer working with samples, and most statistical tests are designed to work with incomplete data, which correspond to samples.

A sample needs to satisfy the following two criteria in order to be valid: (1) random and (2)representative .

→ A random sample means that each element within the sample is strictly chosen randomly from the population.

→ A sample is representative when it reflects an accurate representation of the population. For instance, a sample should not contain only men when the population is men and women.

Data in real life is made of different types. Knowing them is important because different types of data have different characteristics and are collected and analyzed in different ways.

Different types of Data (Image by Author)

There are three main measures of central tendency: mean, median, and mode. When exploring your data, all these three measures should be applied together in order to come to a better conclusion. However, using only one might lead to providing corrupted information about your data.

This section focuses on defining each of them including their pros and cons.

Mean

Also known as average (µ for population, x with overhead bar for sample). It corresponds to the center of a finite set of numbers. The mean is computed by dividing all the numbers by the total number of elements. Considering a set of numbers x1, x2, …, xn the mean is defined as follows:

The formula of the mean from Wikipedia

x with overhead determines the sample mean.
n denotes the total number of observations in the sample set.

Below is an implementation in Python.

# Import the mean function from statistics module
from statistics import mean# Define the set of numbers
data = [5, 53, 4, 8, 6, 9, 1]
# Compute the mean
mean_value = mean(data)
print(f"The mean of {data} is {mean_value} ")

The previous code should generate the following result:

The mean of [5, 53, 4, 8, 6, 9, 1] is 12.28

Even though the mean is mainly used, it does come with the issue that it is easily affected by outliers, hence may not be a better option to make relevant conclusions.

Median

The median represents the middle value of the data after being sorted in ascending or descending order, and the formula is given below.

The formula of the median from Wikipedia

As opposed to the mean, the median is not affected by the presence of outliers and can be for that reason a better measure of central tendency. However, median and mean only work for numerical data.

Using the same data above we can compute the median as follows:

# Import the median function from statistics module
from statistics import median# Compute the median
median_value = median(data)
print(f"The median of {data} is {median_value} ")

The execution generates the result below:


The median of [5, 53, 4, 8, 6, 9, 1] is 6

Let’s break down the computation process of the median value of the data

Step 1: arrange the data in increasing order: [1, 4, 5, 6, 8, 9, 53]
Step 2: in our case n = 7 is odd.
Step 3: the middle value is (n + 1)/2 th term, which is (7+1)/2 = 4th hence 6.

Mode

It corresponds to the most occurring value in the data and can be applied to both numerical and categorical variables.

Similarly to the median, the mode is not sensitive to outliers. However, the mode does not exist when all the values in the data have the same number of occurrences. Most of the time, the maximum number of modes we can observe within the data is three.

Let’s use a different dataset to illustrate the use of the mode.

# Define the data
data = [5, 9, 4, 9, 7, 9, 1]# Compute the mode
mode_value = mode(data)
print(f"The mode of {data} is {mode_value} ")

All the values in the data occur one time except 9 which occurs 3 times. Since the mode corresponds to the most occurring value, the result of the above code is shown as:

The mode of [5, 9, 4, 9, 7, 9, 1] is 9

Skewness and Kurtosis are the two main techniques that can tell more about the shape of a given dataset. This section covers each one in a well-detailed manner, including their illustration using Python.

Before diving into the explanation of each concept, let’s import the necessary Python libraries.

Numpy is used to work with arrays.
The scipy module is for statistical analysis.
For visualization purposes, we use matplotlib library.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from scipy.stats import beta, kurtosis

Skewness

The data is said to be skewed when its probability distribution is not symmetric around the mean of that data. Three main scenarios can happen depending on the value of the skewness.

The following helper function is used to illustrate and plot each case.

# Use the seed to get the same results for randomization
np.random.seed(2023)def plot_skewness(data, label):
plt.hist(data, density=True, color='orange', alpha=0.7, label=label)
plt.axvline(data.mean(), color='green', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(np.median(data), color='blue', linestyle='dashed', linewidth=2, label='Median')
plt.legend()
plt.show()

The skewness is symmetric when the data follows a normal distribution. In this case Mean = Median = Mode.

# Normal distribution
normal_data = np.random.normal(0, 1, 1000)
label = 'Normal: Symetric Skewness'
plot_skewness(normal_data, label)

A normal distribution or symmetric skewness (Image by Author)

There is a positive skewness or right skewness when the value is greater than zero. This means that the right side of the mean value contains more value, and the mean is to the right side of the median. In this case, we have Mean > Median > Mode.

# Exponential distribution
exp_data = np.random.exponential(1, 1000)
label = 'Exponential: Positive Skewness'
plot_skewness(exp_data, label)

Illustration of the Positive skewness (Image by Author)

When it is less than zero, then, there is a negative skewness or skewed to the left. In this case, the left side is the one that contains more value, and we generally find the mean to the left of the median. In this scenario Mean < Median < Mode.

# Beta 
beta_data = beta.rvs(5, 2, size=10000)
label = 'Beta: Negative Skewness'
plot_skewness(beta_data, label)

Illustration of the Negative skewness (Image by Author)

Kurtosis

The kurtosis property is used to determine the flatness of the distribution of the data. It tells us if the data is spread out or concentrated around the mean.

A distribution with a higher concentration around the mean is said to have high kurtosis. A low kurtosis is related to a more flat distribution with fewer data concentrated around the mean.

Furthermore, kurtosis is used to check whether the data follows a normal distribution, and also for detecting the presence of outliers in the data.

There are overall three main types of Kurtosis that a given data can display: (1) Mesokurtic, (2) Leptokurtic, and (3) Platykurtic . In addition to explaining each concept, the python code will be shown how to compute each one.

(1) Mesokurtic in this case, kurtosis=3 . This means that kurtosis is similar to one of a normal distribution, and it is mainly used as a baseline against the existing distributions.

Illustration of the Mesokurtic distribution (Image by Author)

(2) Leptokurtic also known as positive kurtosis has kurtosis>3 . Often referred to as a “peaked” distribution, Leptokurtic has a higher concentration of data around the mean, compared to the normal distribution.

Illustration of the Leptokurtic distribution (Image by Author)

(3) Platykurtic also known as negative kurtosis has kurtosis<3 . Often referred to as a “flat” distribution, Leptokurtic has a lower concentration of data around the mean as opposed to the Platykurtic kurtosis and has also shorter tails.

Illustration of the Platykurtic distribution (Image by Author)

The following code from the official documentation of scipy perfectly illustrates how to compute the kurtosis.

x = np.linspace(-5, 5, 100)
ax = plt.subplot()
distnames = ['laplace', 'norm', 'uniform']for distname in distnames:
if distname == 'uniform':
dist = getattr(stats, distname)(loc=-2, scale=4)
else:
dist = getattr(stats, distname)
data = dist.rvs(size=1000)
kur = kurtosis(data, fisher=True)
y = dist.pdf(x)
ax.plot(x, y, label="{}, {}".format(distname, round(kur, 3)))
ax.legend()

Illustration of the three main kurtosis and their values (Image from the code)

The Laplace distribution carries the properties of a Leptokurtic kurtosis. It has a tail that is more pronounced than that of the normal distribution.
The uniform distribution has the least pronounced tail due to its negative kurtosis (Platykurtic).

This first section of the series has covered the different types of data, the difference between sample and population, the main measures of central tendency, and finally, the measures of asymmetry.

Stay tuned for the next section which will cover more topics to help you acquire relevant statistics skills.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Would you like to buy me a coffee ☕️? → Here you go!

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

Source code available on GitHub.