QQ Plotting Your Way to Data Enlightenment: A Hitchhiker’s Guide to the Galaxies of Distribution | by Naman Agrawal | Apr, 2023

By Jessie Hobb On Apr 26, 2023

Is Your Data Normal? The Ultimate Guide to QQ Plots Using R

Statistics is a mysterious field. Within its vast corpus of theories, formulations, and frameworks, lies knowledge that has profound applications across all areas of study. However, despite its tremendous success, it can be unfriendly too because a lot of its power comes from the assumptions it makes. Some of the most common assumptions include the i.i.d. assumption (independent and identically distributed) and the normality assumption (that the data in question can be approximated/converges to a normal distribution). As soon as these assumptions start to break, the theories start to become more and more complex, and even if they do work, they tend to lose a lot of their power. Fortunately, there are theorems and laws that support these assumptions, such as the Central Limit Theorem, which demonstrates that the normality assumption holds true for estimators from various distributions for large sample sizes. However, in order to fully realize the potential of statistics and harness its power, it’s essential to verify that the data meet these assumptions before using statistical frameworks or tests to draw conclusions. By carefully assessing whether the data meets the necessary assumptions and adjusting our methods accordingly, we can ensure that statistical analyses provide accurate and reliable insights that can help us make informed decisions across a wide range of disciplines.

In this article, we will explore one of the most powerful methods to test if your data is normally distributed: QQ Plots. As the name suggests, it’s a graphical method that requires a visual inspection to conclude if your data is distributed normally or not. There are not a lot of visual methods in statistics as most hypothesis-testing frameworks rely on estimators, critical regions, and p-values to obtain numeric estimates. There are such frameworks (eg. Kolmogorov Smirnov Test) for testing normality as well. But these frameworks have been found to be much weaker compared to the much simpler method of using QQ Plots. It is also important to acknowledge that QQ plots can be used to test that your data follows any distribution, not necessarily a normal one. For example, if you believe that income is exponentially distributed, you can still use a QQ plot to check for it. It’s just that the test is mostly used in the context of normal distributions, because of their immense utility in Statistics.

That’s a valid question: If constructing QQ plots is also a graphical method, why not just draw the histogram of your data and compare it with that of a normal distribution? However, there are some issues to consider.

What bin size would you use? While a very large bin size may cause you to miss out on some important features of the distribution, a small bin width may show a lot of noise, making interpretation difficult. The choice of bin width can significantly transform your perception and understanding of the underlying distribution.
Additionally, visually inspecting the fit between the histogram and the normal distribution curve can be tedious and subjective. You may overlay a normal distribution curve (as shown in Figure 1) and see how well it fits the heights of the bins, but this process is prone to mistakes.

Figure 1: Difficulty in Using Histograms [Image by Author]

Using a QQ plot is much simpler because drawing the plot does not require you to tune any parameters. You just feed in the data and generate the plot. Moreover, visual inspection is much easier: you just need to check how closely two lines match, without the added hurdle of comparing bars at different locations along a bell-shaped curve, which can be especially challenging with skewed or multimodal distributions.

A QQ Plot (Quantile-Quantile Plot) is a plot of the sample (or observed) quantiles of the given data against the theoretical (or expected) quantiles. Since we’ll be using these plots to check if the data is normally distributed, the theoretical quantiles must correspond to the quantiles of the normal distribution. But, what are these quantiles?

Quantiles are values that divide a distribution into equal parts. For example, the median is the 50th percentile (or quantile), which divides the distribution into two equal parts.

Sample quantiles are estimated quantiles based on a sample of data. Given a sample of n observations, the kth sample quantile is the value that divides the sample into two parts: the first k/n part and the remaining (n-k)/n part. For example, suppose we are given the following data:

How would you calculate the 50th quantile? The first step would be to arrange the data in ascending order giving us:

Notice that the value 37 splits the data into 2 equal parts. 5, …, 21 constitute the first part which is 5/10 of the remaining data. 43, …, 88 constitute the second part which is 5/10 of the remaining data. In general, the kth smallest sample is referred to as the kth order statistic of the data and represented as X₍ₖ₎. In fact, the kth order statistic is the 100*k/(n + 1) quantile for the data.

For example, in the given data, 10 is the 3rd small valuest value. Thus, X₍₃₎ = 10. This value represents the 100*3/(11 + 1) = 25th quantile of the data. The proof of the above generalization is slightly complex as it makes use of the properties of beta distribution and generalized CDF of order statistics. It has been presented below for the sake of completeness. However, if you don’t feel like knowing about it, you may choose to skip over it and proceed to the discussion on sample quantiles.

Theorem: For an independent and identically distributed data X₁, X₂, · · · , Xₙ of size n, the kth order statistic X₍ₖ₎ is an unbiased estimator for the 100*p quantile, where p is given by:

In other words, if F is the CDF of Xᵢ,

(Recall that F represents the cumulative distribution. If the cumulative distribution function at a certain value is expected to be k/(n +1), that value is an unbiased estimator for the k/(n +1) quantile)

Proof: First, we find the distribution of the kth order statistic of the uniform distribution: Recall that the probability density function and the cumulative density functions of the standard uniform distribution are given by:

We can now find the CDF of the kth order statistic of the uniform distribution. Let W be the number of Xᵢ less than x. It is evident that W follows a binomial distribution with parameters n = n; p = Fᵤ(x). Thus, the CDF of the kth order statistic is given as:

Note: The I used corresponds to the indicator function, which takes the value 1 when the condition inside the curly brackets is satisfied and 0 otherwise. Now, we can differentiate the above function with respect to x to obtain the probability density function of the kth order statistic of the uniform distribution:

The above distribution resembles the structure of the beta distribution. Recall that the density function of Y ∼ Beta(a, b) and the corresponding expected value is given by:

On comparing the probability distributions of the kth order statistic of the uniform distribution and the PDF of the beta distribution, we conclude that:

Next, we rely on the proof of one of the most significant formulations in statistics: We show that if X is a random variable with CDF Fₓ(x), then the PDF of Fₓ(X) is a standard uniform distribution. Let Y = Fₓ(X).

Finally, we compute the distribution of Fₓ(X₍ₖ₎). We use the fact that the CDF of a random variable is always an increasing function. Thus,

Now, let’s talk about theoretical quantiles. Theoretical quantiles are the quantiles of a theoretical distribution, such as the normal distribution. Unlike sample quantiles, which are estimated based on the given data, theoretical quantiles are determined by the distribution’s parameters, such as the mean and standard deviation. Their value does not depend on the sample provided but only depends on the distribution for which the data is being tested. For example, the 5% theoretical quantile of a standard normal distribution is approximately -1.645, which means that 5% of the area under the standard normal curve lies to the left of -1.645. In general, if we postulate that the data given has a distribution N(µ, σ² ), the 100*p (for p between 0 and 1) theoretical quantile (πₚ) is given by:

Where Φ and Φ⁻¹ refer to the CDF and the inverse CDF of the normal distribution respectively. From the above discussion, it is evident that given a dataset of n values, we can do two things:

We can determine the corresponding same quantiles for each value in the dataset. We do so by arranging the values in ascending order and concluding that the kth smallest sample is the p = k/(n + 1)th sample quantile.
For each given sample quantile, we can compute the associated theoretical quantile. In particular, if we want to test that our data follows the distribution N(µ, σ² ), we can conclude that the pth theoretical quantile is πₚ = µ+σΦ⁻¹(p).

This is exactly the schema for constructing a QQ plot. Let’s try to understand this with an example.

Consider the following dataset of 9 observations:

We shall now construct a QQ plot for the given dataset to test it against the distribution of N(1, 2). For simplicity, we have taken only 10 observations. However, using R or any other programming language, the interpretation can be extended to any finite number of points as we will be seeing in the next section. For solving this problem, we consider the two steps discussed above:

A) We can determine the corresponding same quantiles for each value in the dataset. We do so by arranging the values in ascending order and concluding that the kth smallest sample is the p = k/(n + 1)th sample quantile: First, we arrange the dataset in ascending order and assign the corresponding order statistic values:

Notice how we have used a tabular format to represent the data. This can be quite helpful while constructing QQ plots by hand as it allows you to neatly encapsulate the necessary information. Next, we note down the quantile that each of these order statistics estimates:

B) For each given sample quantile, we can compute the associated theoretical quantile. In particular, if we want to test that our data follows the distribution N(µ, σ² ), we can conclude that the pth theoretical quantile is πₚ = µ + σΦ⁻¹(p): Using the derived formula, we can estimate the corresponding theoretical quantiles of the normal distribution as shown below:

The values of Φ⁻¹ can be found using a z-table or statistical software (e.g., the qnorm function in R). Thus, we have obtained both the sample quantile values (ˆπp = X(k)) and the theoretical quantile values πₚ = µ + σΦ⁻¹(p). The last step would be to plot them together in a scatter plot. This gives us the following plot:

And that’s the QQ plot for the given data! Yes, it’s simply the plot of the theoretical quantiles against the sample quantiles. Notice that I’ve also added a dotted line y = x, which is to compare the given QQ plot against the case when both the theoretical quantiles and the sample quantiles are the same. If the data were normally distributed, the points on the QQ plot will fall along the line. The closer the points are to the line, the more normal the data is. Likewise, if the data is not normally distributed, the points on the QQ plot will deviate from the straight line. For example, in the above plot, the sample quantile values are always lower than the corresponding theoretical quantile values, which supports the hypothesis that the data is not normally distributed. In fact, the dataset used above was actually obtained from a uniform distribution. Thus, we have been able to use the concept of a QQ plot to explain how normal the given data is.

Now, let’s look at an example where the data is actually normally distributed. Consider a dataset of size 14 this time:

The first step would be to draw a table by arranging the values in ascending order and computing the corresponding order statistics as well as the quantiles they represent. Subsequently, we fill in the theoretical quantile values. The completed table is shown as follows:

Plotting the above points on a graph as before yields the following QQ plot:

As you can see, the points tend to fall well along the line, indicating that the distribution of the data is quite similar to that of the normal distribution with a mean of 1 and a variance of 2. Note that QQ plots may not be very helpful if the dataset has very few data points. However, if the number of samples is large enough, QQ plots will serve as a very powerful tool to test for the distribution of your dataset, it’ll be your hitchhiker’s guide to the galaxies of probability distributions!

We’ll be using the ToothGrowth dataset available in base R, which contains information on the effect of Vitamin C on tooth growth in guinea pigs. In particular, we will look at the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs and test if the distribution of the tooth length is normal. We can load the dataset as follows:

# Load Packages
library(tidyverse)# Load Dataset
data("ToothGrowth")
x = ToothGrowth$len
n = length(x)
# Length:
print(n)
> [1] 60
# First few observations
print(x[1:10])
> [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0

We are interested in testing if the dataset is normally distributed with parameters µ and σ as follows:

# Mean of target distribution
mu <- mean(x)
print(mu)
> [1] 18.81333# Variance of target distribution
sigma <- sd(x)
print(sigma)
> [1] 7.649315

Step 1: Calculate Sample Quantiles For this we need to just arrange the data in ascending order. The corresponding quantile values will range from 1/(n + 1) to n/(n + 1) as calculated below:

# Sample Quantiles: Sort Sample Provided
sq = sort(x)
print(sq[1:10])
> [1] 4.2 5.2 5.8 6.4 7.0 7.3 8.2 9.4 9.7 9.7# Corresponding Quantile Values: k/(n + 1), k = 1, 2, ..., n
p = seq(1/(n + 1), n/(n + 1), by = 1/(n + 1))
print(p[1:5])
> [1] 0.01639344 0.03278689 0.04918033 0.06557377 0.08196721

Step 2: Calculate Theoretical Quantiles Using the formula πₚ = µ + σΦ⁻¹(p), we calculate the theoretical quantiles mapped to each of the sample quantiles. Note: The qnorm() function in R allows us to calculate the inverse CDF of the normal distribution:

# Theoretical Quantiles: Use Derived Formula
tq = mu + sigma*qnorm(p)
print(tq[1:5])
> [1] 2.484468 4.728449 6.170135 7.265988 8.165790

Step 3: Plot the Theoretical Quantiles against the Sample Quantiles and Compare against y = x

data.frame(tq, sq) %>%
ggplot(aes(x = tq, y = sq)) +
# Plot original points
geom_point(size = 2.5, color = "blue") +
# To plot the y = x line
geom_abline(slope = 1, intercept = 0, linewidth = 0.8, linetype = 2) +
# Plot styling [Optional]
theme_minimal() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "QQ Plot") +
theme(plot.title = element_text(hjust = 0.5))

This yields the following QQ plot:

As you can see, the points tend to fall well along the line, indicating that the distribution of the data is quite similar to that of the normal distribution with a mean of 18.81333 and a variance of 7.649315² = 58.51202. Finally, to illustrate the immense usefulness of QQ plots, let’s look at another example, where we try to test if the data is exponentially distributed. In particular, We’ll sample data from the uniform distribution and use QQ plots to check if it is exponentially distributed. We can load the sample as follows:

# Load Sample
x = runif(10000)
n = length(x)# First few observations
print(x[1:5])
> [1] 0.07251043 0.20345894 0.20417683 0.48878998 0.50945799

We are interested in testing if the dataset is exponentially distributed with the rate parameter equal to the reciprocal of the mean (the maximum likelihood estimator for the rate parameter).

Step 1: Calculate Sample Quantiles For this we need to just arrange the data in ascending order. The corresponding quantile values will range from 1/(n + 1) to n/(n + 1) as calculated below:

# Sample Quantiles: Sort Sample Provided
sq = sort(x)
print(sq[1:5])
> [1] 6.094808e-05 9.398628e-05 3.439812e-04 3.590921e-04 4.317588e-04# Corresponding Quantile Values: k/(n + 1), k = 1, 2, ..., n
p = seq(1/(n + 1), n/(n + 1), by = 1/(n + 1))
print(p[1:5])
> [1] 0.00009999 0.00019998 0.00029997 0.00039996 0.00049995

Step 2: Calculate Theoretical Quantiles This time, we use the formula πₚ = Φ⁻¹(p, λ = 1/mean(X)), to calculate the theoretical quantiles mapped to each of the sample quantiles, where Φ⁻¹(p, λ) is the inverse of the CDF of the exponential distribution with rate λ. Note: The qexp() function in R allows us to calculate the inverse CDF of the exponential distribution:

# Theoretical Quantiles: Use Derived Formula
tq = qexp(p, rate = 1/mean(x))
print(tq[1:5])
> [1] 5.010184e-05 1.002087e-04 1.503205e-04 2.004374e-04 2.505593e-04

Step 3: Plot the Theoretical Quantiles against the Sample Quantiles and Compare against y = x

data.frame(tq, sq) %>%
ggplot(aes(x = tq, y = sq)) +
# Plot original points
geom_point(size = 2.5, color = "blue") +
# To plot the y = x line
geom_abline(slope = 1, intercept = 0, linewidth = 0.8, linetype = 2) +
# Plot styling [Optional]
theme_minimal() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "QQ Plot") +
theme(plot.title = element_text(hjust = 0.5))

This yields the following QQ plot:

As you can see, the points tend to fall far from the line, indicating that the distribution of the data is not similar to that of the exponential distribution. In fact, we can draw a QQ plot of the sample against the Uniform distribution as follows to show that the sample indeed fits a standard uniform distribution:

# Theoretical Quantiles
tq = qunif(p)
print(tq[1:5])data.frame(tq, sq) %>%
ggplot(aes(x = tq, y = sq)) +
# Plot original points
geom_point(size = 2.5, color = "blue") +
# To plot the y = x line
geom_abline(slope = 1, intercept = 0, linewidth = 0.8, linetype = 2) +
# Plot styling [Optional]
theme_minimal() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "QQ Plot") +
theme(plot.title = element_text(hjust = 0.5))

This yields the following QQ plot, a perfect match indeed!

In summary, our examination of QQ plots has revealed their crucial role in statistical analysis. By comparing the distribution of our data to a theoretical distribution, QQ plots enable us to gain deeper insights into the behavior of our data and identify any departures from the expected distribution. Furthermore, the versatility of QQ plots allows us to apply them in various scenarios, such as identifying outliers, comparing datasets, and model diagnostics. These applications demonstrate the broad utility of QQ plots and underscore their importance in the realm of statistical analysis. QQ plots provide a powerful and easy-to-understand approach for exploring the distribution of data, and their significance in statistical literature is a testament to their value. By leveraging QQ plots in our analyses, we can unlock new insights into our data and make informed decisions based on a deeper understanding of its behavior.

In case you have any doubts or suggestions, do reply in the comment box. Please feel free to contact me via mail.

If you liked my article and want to read more of them, visit this link.

Note: All the images containing tables, plots, and equations have been made by the author.

Is Your Data Normal? The Ultimate Guide to QQ Plots Using R

What bin size would you use? While a very large bin size may cause you to miss out on some important features of the distribution, a small bin width may show a lot of noise, making interpretation difficult. The choice of bin width can significantly transform your perception and understanding of the underlying distribution.
Additionally, visually inspecting the fit between the histogram and the normal distribution curve can be tedious and subjective. You may overlay a normal distribution curve (as shown in Figure 1) and see how well it fits the heights of the bins, but this process is prone to mistakes.

Quantiles are values that divide a distribution into equal parts. For example, the median is the 50th percentile (or quantile), which divides the distribution into two equal parts.

How would you calculate the 50th quantile? The first step would be to arrange the data in ascending order giving us:

In other words, if F is the CDF of Xᵢ,

The above distribution resembles the structure of the beta distribution. Recall that the density function of Y ∼ Beta(a, b) and the corresponding expected value is given by:

On comparing the probability distributions of the kth order statistic of the uniform distribution and the PDF of the beta distribution, we conclude that:

Finally, we compute the distribution of Fₓ(X₍ₖ₎). We use the fact that the CDF of a random variable is always an increasing function. Thus,

Where Φ and Φ⁻¹ refer to the CDF and the inverse CDF of the normal distribution respectively. From the above discussion, it is evident that given a dataset of n values, we can do two things:

We can determine the corresponding same quantiles for each value in the dataset. We do so by arranging the values in ascending order and concluding that the kth smallest sample is the p = k/(n + 1)th sample quantile.
For each given sample quantile, we can compute the associated theoretical quantile. In particular, if we want to test that our data follows the distribution N(µ, σ² ), we can conclude that the pth theoretical quantile is πₚ = µ+σΦ⁻¹(p).

This is exactly the schema for constructing a QQ plot. Let’s try to understand this with an example.

Consider the following dataset of 9 observations:

Now, let’s look at an example where the data is actually normally distributed. Consider a dataset of size 14 this time:

Plotting the above points on a graph as before yields the following QQ plot:

# Load Packages
library(tidyverse)# Load Dataset
data("ToothGrowth")
x = ToothGrowth$len
n = length(x)
# Length:
print(n)
> [1] 60
# First few observations
print(x[1:10])
> [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0

We are interested in testing if the dataset is normally distributed with parameters µ and σ as follows:

# Mean of target distribution
mu <- mean(x)
print(mu)
> [1] 18.81333# Variance of target distribution
sigma <- sd(x)
print(sigma)
> [1] 7.649315

Step 1: Calculate Sample Quantiles For this we need to just arrange the data in ascending order. The corresponding quantile values will range from 1/(n + 1) to n/(n + 1) as calculated below:

# Sample Quantiles: Sort Sample Provided
sq = sort(x)
print(sq[1:10])
> [1] 4.2 5.2 5.8 6.4 7.0 7.3 8.2 9.4 9.7 9.7# Corresponding Quantile Values: k/(n + 1), k = 1, 2, ..., n
p = seq(1/(n + 1), n/(n + 1), by = 1/(n + 1))
print(p[1:5])
> [1] 0.01639344 0.03278689 0.04918033 0.06557377 0.08196721

# Theoretical Quantiles: Use Derived Formula
tq = mu + sigma*qnorm(p)
print(tq[1:5])
> [1] 2.484468 4.728449 6.170135 7.265988 8.165790

Step 3: Plot the Theoretical Quantiles against the Sample Quantiles and Compare against y = x

data.frame(tq, sq) %>%
ggplot(aes(x = tq, y = sq)) +
# Plot original points
geom_point(size = 2.5, color = "blue") +
# To plot the y = x line
geom_abline(slope = 1, intercept = 0, linewidth = 0.8, linetype = 2) +
# Plot styling [Optional]
theme_minimal() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "QQ Plot") +
theme(plot.title = element_text(hjust = 0.5))

This yields the following QQ plot:

# Load Sample
x = runif(10000)
n = length(x)# First few observations
print(x[1:5])
> [1] 0.07251043 0.20345894 0.20417683 0.48878998 0.50945799

We are interested in testing if the dataset is exponentially distributed with the rate parameter equal to the reciprocal of the mean (the maximum likelihood estimator for the rate parameter).

Step 1: Calculate Sample Quantiles For this we need to just arrange the data in ascending order. The corresponding quantile values will range from 1/(n + 1) to n/(n + 1) as calculated below:

# Sample Quantiles: Sort Sample Provided
sq = sort(x)
print(sq[1:5])
> [1] 6.094808e-05 9.398628e-05 3.439812e-04 3.590921e-04 4.317588e-04# Corresponding Quantile Values: k/(n + 1), k = 1, 2, ..., n
p = seq(1/(n + 1), n/(n + 1), by = 1/(n + 1))
print(p[1:5])
> [1] 0.00009999 0.00019998 0.00029997 0.00039996 0.00049995

# Theoretical Quantiles: Use Derived Formula
tq = qexp(p, rate = 1/mean(x))
print(tq[1:5])
> [1] 5.010184e-05 1.002087e-04 1.503205e-04 2.004374e-04 2.505593e-04

Step 3: Plot the Theoretical Quantiles against the Sample Quantiles and Compare against y = x

data.frame(tq, sq) %>%
ggplot(aes(x = tq, y = sq)) +
# Plot original points
geom_point(size = 2.5, color = "blue") +
# To plot the y = x line
geom_abline(slope = 1, intercept = 0, linewidth = 0.8, linetype = 2) +
# Plot styling [Optional]
theme_minimal() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "QQ Plot") +
theme(plot.title = element_text(hjust = 0.5))

This yields the following QQ plot:

# Theoretical Quantiles
tq = qunif(p)
print(tq[1:5])data.frame(tq, sq) %>%
ggplot(aes(x = tq, y = sq)) +
# Plot original points
geom_point(size = 2.5, color = "blue") +
# To plot the y = x line
geom_abline(slope = 1, intercept = 0, linewidth = 0.8, linetype = 2) +
# Plot styling [Optional]
theme_minimal() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "QQ Plot") +
theme(plot.title = element_text(hjust = 0.5))

This yields the following QQ plot, a perfect match indeed!

In case you have any doubts or suggestions, do reply in the comment box. Please feel free to contact me via mail.

If you liked my article and want to read more of them, visit this link.

Note: All the images containing tables, plots, and equations have been made by the author.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.