How to Find the Best Theoretical Distribution for Your Data. | by Erdogan Taskesen | Feb, 2023


Image by Author.

Knowing the underlying (probability) distribution of your data has many modeling advantages. The easiest manner to determine the underlying distribution is by visually inspecting the random variable(s) using a histogram. With a candidate distribution, a Pareto plot, Probability Distribution Function plot (PDF/CDF), and QQ plot can be created for a better comparison. However, to determine the exact distribution parameters (e.g., loc, scale), it is essential to use quantitative methods. In this blog, I will describe Why it is important to determine the underlying probability distribution for your data set. How to determine the best fit using a visual inspection and in a quantitative manner, and what the differences are between parametric and non-parametric distributions. Analyses are performed using the distfit library, and a notebook is accompanied for easy access and experimenting.

Before you continue reading, have a look at some definitions and terminology. I will be using it a lot in this blog.

“Random variables are variables whose value is unknown or a function that assigns values to each of an experiment’s outcomes. A random variable can be either discrete (having specific values) or continuous (any value in a continuous range)” [1]. It can be a single column in your data set for a specific feature, such as human height. It can also be your entire data set that is measured with a sensor and contains thousands of features.

Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable [1, 2]. The normal distribution is a common example of a PDF (the well-known bell-shaped curve). The term PDF is sometimes also described as “distribution function” of “probability function”.

Theoretical distribution is a form of a PDF. Examples of theoretical distributions are the Normal, Binomial, Exponential, Poisson etc distributions [3]. The distfit library contains 89 theoretical distributions.

A Empirical distribution (or data distribution) is a frequency based distributions of observed random variables (the input data) [4]. A histogram is commonly used to visualize the emperical distribution.

The probability density function is a fundamental concept in statistics. Briefly, for a given random variable X, we aim to specify the function f that gives a natural description of the distribution of X. Although there is a lot of great material that describes these concepts [1], it can remain challenging to understand why it is important to know the underlying data distribution for your data set. Let me try to explain the importance with a small analogy. Suppose you need to go from location A to B, which type of car would you prefer? The answer is straightforward. You will likely start with exploring the terrain. With that information, you can then select the best-suited car (sports car, four-wheel drive, etc). Logically, a sports car is better suited for smooth, flat terrain, while a four-wheel drive is better suited for rough, hilly terrain. In other words, without the exploratory analysis of the terrain, it can be hard to select the best possible car. However, such an exploratory step is easily forgotten or neglected in data modeling.

Having knowledge about the underlying data distribution is an important first step before making modeling decisions.

When it comes to data, it is important to explore the fundamental characteristics of the data too, such as skewness, kurtosis, outliers, distribution shape, univariate, bimodal, etc. Based on these characteristics it is easier to decide which models are best to use because most models have prerequisites for the data. As an example, a well-known and popular technique is Principal Component Analysis (PCA). This method computes the covariance matrix and requires the data to be multivariate normal for the PCA to be valid. In addition, a PCA is also known to be sensitive to outliers. Thus, before doing a PCA step, you need to know whether your data needs a (log)normalization or whether outliers need to be removed. More details about PCA can be found here [5]. The distfit library is developed for the first task; to determine the function f that gives a natural description of the empirical data distribution.

The histogram is a well-known plot in data analysis which is a graphical representation of the distribution of the dataset. The histogram summarizes the number of observations that fall within the bins. With libraries such as matplotlib hist() it is straightforward to make a visual inspection of the data. Changing the range of the number of bins will help to identify whether the density looks like a common probability distribution by the shape of the histogram. An inspection will also give hints whether the data is symmetric or skewed and whether it has multiple peaks or outliers. In most cases, you will observe a distribution shape as depicted in Figure 1.

  • The bell shape of the Normal distribution.
  • The descending or ascending shape of an Exponential or Pareto distribution.
  • The flat shape of the Uniform distribution.
  • The complex shape that does not fit any of the theoretical distributions (e.g, multiple peaks).
Figure 1. The common distribution types. Figures are created using the distfit library and can be further explored in the Colab notebook (link at the bottom). (image by the author).

In case you find distributions with multiple peaks (bimodal or multimodal), the peaks should not disappear with different numbers of bins. Bimodal distributions usually hint toward mixed populations. In addition, if you observe large spikes in density for a given value or a small range of values, it may point toward possible outliers. Outliers often present themselves on the tail of a distribution and are far away from the rest of the density.

A histogram is a great manner to inspect a relatively small number of samples (random variables, or data points). However, when the number of samples increase or more than two histograms are plotted, the visuals can become troublesome, and a visual comparison with a theoretical distribution is difficult to judge. Instead, a Cumulative Distribution Function (CDF) plot or Quantile-Quantile plot (QQ plot) can be more insightful. But these plots require a candidate theoretical distribution(s) that best matches (or fits) with the empirical data distribution. So let’s determine the best theoretical distribution in the next section!

A PDF fit for an empirical data distribution can be discovered in four steps:

  1. Compute density and weights from a histogram. The first step is to flatten the data into an array and create the histogram by grouping observations into bins and counting the number of events in each bin. The choice of the number of bins is important as it controls the coarseness of the distribution. Experimenting with different bin sizes can provide multiple perspectives on the same data. In distfit, the bin width can be manually defined or mathematically determined on the observations themselves. The latter option is the default.
  2. Estimate the distribution parameters from the data. In a parametric approach, the next step is to estimate the shape, location, and scale parameters based on the (selected) theoretical distribution(s). This typically involves methods such as maximum likelihood estimation (MLE) to determine the values of the parameters that best fit the data. For example, if the normal distribution is chosen, the MLE method will estimate the mean and standard deviation of the data.
  3. Check the goodness-of-fit. Once the parameters have been estimated, the fit of the theoretical distribution is evaluated. This can be done using a goodness-of-fit test. Popular statistical tests are Residual Sum of Squares (RSS, also named SSE), Wasserstein, Kolmogorov-Smirnov, and the Energy test (also available in distfit).
  4. Selection of best theoretical distribution. At this point, theoretical distributions are tested and scored using the goodness-of-fit test statistic. The scores can now be sorted and the theoretical distribution with the best score can be selected.

As a final step, the model can be validated using methods such as cross-validation, bootstrapping, or a holdout dataset. It is essential to check if the model is generalizing well and also to check if the assumptions such as independence and normality are met. Once the theoretical distribution has been fitted and validated, it can be used in many applications (keep on reading in the section below).

Having knowledge about distribution fitting has great benefits when working in the field of data science. It is not only to better understand, explore, and prepare the data but also to bring fast and lightweight solutions.

Distfit is a python package for probability density fitting of univariate distributions for random variables. It can find the best fit for parametric, non-parametric, and discrete distributions. In addition, it provides visual insights for better decision-making using various plots. A summary of the most important functionalities are:

  • Finds the best fit for parametric, non-parametric, and discrete distributions.
  • Prediction of outliers/novelties for (new) unseen samples.
  • Generates synthetic data based on the fitted distribution.
  • Plots: Histograms, Probability Density Function plots, Cumulative Density Function plots (CDF), Pareto plots, Quantile-Quantile plots (QQ-plot), Probability plots, and Summary plots.
  • Saving and loading models.

Installation is straightforward and can be done from PyPi.

pip install distfit

With the distfit library it is easy to find the best theoretical distribution with only a few lines of code.

With parametric fitting, we make assumptions about the parameters of the population distribution from the input data. Or in other words, the shape of the histogram should match any of the known theoretical distributions. The advantage of the parametric fitting is that it is computationally efficient and the results are easy to interpret. The disadvantage is that it can be sensitive to outliers when having a low number of samples. The distfit library can determine the best fit across 89 theoretical distributions which are utilized from the scipy library. To score the fit, there are four goodness-of-fit statistical tests; Residual Sum of Squares (RSS or SSE), Wasserstein, Kolmogorov-Smirnov (KS), and Energy. For each fitted theoretical distribution, the loc, scale, and arg parameters are returned, such as mean and standard deviation for normal distribution.

Finding the best matching theoretical distribution for your data set requires a goodness-of-fit statistical test.

In the following example, we will generate data from the normal distribution with mean=2 and standard deviation=4. We will use distfit to estimate these two parameters from the data itself. If you already know the family of distributions (e.g., bell-shape), you can specify a subset of distributions. The default is a subset of common distributions (as depicted in Figure 1). Note that due to the stochastic component, results can differ slightly from what I am showing when repeating the experiment.

# Import libraries
import numpy as np
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the parametric approach.
dfit = distfit(method=’parametric’, todf=True)

# Alternatively limit the search for only a few theoretical distributions.
dfit = distfit(method='parametric', todf=True, distr=['norm', 'expon'])

# Fit model on input data X.
dfit.fit_transform(X)

# Print the bet model results.
dfit.model
# 'stats': 'RSS',
# 'name': 'loggamma',
# 'params': (761.2276, -725.194369, 109.61),
# 'score': 0.0004758991728293508,
# 'loc': -725.1943699246272,
# 'scale': 109.61710960155318,
# 'arg': (761.227612981012,),
# 'CII_min_alpha': -4.542099829801259,
# 'CII_max_alpha': 8.531658573964933
# 'distr': <scipy.stats._continuous_distns.loggamma_gen>,
# 'model': <scipy.stats._distn_infrastructure.rv_continuous_frozen>,

The best fit that is detected (i.e., with the lowest RSS score) is the loggamma distribution. The results of the best fit are stored dfit.model but we can also inspect the fit for all other PDFs as depicted in dfit.summary (see code section below) and create a plot (Figure 2).

# Print the scores of the distributions:
dfit.summary[['distr', 'score', 'loc', 'scale']]
# distr score loc scale
#0 loggamma 0.000476 -725.19437 109.61711
#1 t 0.00048 2.036554 3.970414
#2 norm 0.00048 2.036483 3.970444
#3 beta 0.000481 -72.505842 133.797587
#4 gamma 0.000498 -304.071325 0.051542
#5 lognorm 0.000507 -325.188197 327.201051
#6 genextreme 0.001368 0.508856 3.947172
#7 dweibull 0.005371 2.102396 3.386271
#8 uniform 0.079545 -12.783659 30.766669
#9 expon 0.108689 -12.783659 14.820142
#10 pareto 0.108689 -1073741836.783659 1073741824.0

# Plot the top fitted distributions.
model.plot_summary()

Figure 2. Fitted distributions were sorted on the goodness-of-fit scoring test. (image by the author)

But why did the normal distribution not have the lowest Residual Sum of Squares despite we generated random normal data?

Well, first of all, our input data set will always be a finite list that is bound within a (narrow) range. In contradition, the theoretical (normal) distribution goes to infinity in both directions. Secondly, all statistical analyses are based on models, and all models are merely simplifications of the real world. Or in other words, to approximate the theoretical distributions, we need to use multiple statistical tests, each with its own (dis)advantages. Finally, some distributions have a very flexible character for which the (log)gamma is a clear example. For a large k, the gamma distribution converges to normal distribution [7].

The result is that the top 7 distributions have a similar and low RSS score, among them the normal distribution. We can see in the summary statistics that the estimated parameters for the normal distribution are loc=2.036 and scale=3.97, which is very close to our initially generated random sample population (mean=2, std=4). All things considered, A very good result.

Choosing the best model is not a statistical question; it is a modeling decision.

It is good to realize now that the statistical tests only help you to look in the right direction, and that choosing the best model is not a statistical question; it is a modeling decision [8]. Think about this: the loggamma distributions are heavy right-tailed, while the normal distribution is symmetrical (both tails are similar). This can make a huge difference when using confidence intervals and predicting outliers in the tails. Choose your distribution wisely so that it matches the application.

A best practice is to use both statistics and a visual curation to decide what the best distribution fit is. Using a Pareto plot, PDF/CDF and QQ plots can be some of the best tools to guide those decisions. As an example, Figure 2 illustrates the goodness-of-fit test statistics for which the first 7 PDFs have a very similar and low RSS score. The dweibull distribution is ranked number 8, with also a low RSS score. However, a visual inspection will learn us that, despite having a relatively low RSS score, it is not a good fit after all.

Let’s start plotting the empirical data using a histogram and the PDF. Note that such a plot is also called a Pareto plot. These plots will help to visually guide whether a distribution is a good fit. We can see in Figure 3 the PDF (left) with the confidence intervals and on the right side the CDF plot. The confidence intervals are automatically set to 95% CII but can be changed using the alpha parameter during initialization. When using the plot functionality, it automatically shows the histogram in bars and with a line, PDF/CDF, and confidence intervals. All these properties can be manually customized (see code section below).

# Create subplot
fig, ax = plt.subplots(1,2, figsize=(25, 10))
# Pareto plot (PDF with histogram)
dfit.plot(chart='PDF', ax=ax[0])
# Plot the CDF
dfit.plot(chart='CDF', ax=ax[1])

# Change or remove properties of the chart.
dfit.plot(chart='PDF',
emp_properties=None,
bar_properties=None,
pdf_properties={'color': 'r'},
cii_properties={'color': 'g'})

Figure 3. In both panels is shown the Empirical data distribution with the confidence intervals. Left panel: Pareto plot containing PDF and histogram. Right panel: CDF. (image by the author)

We can also plot all other estimated theoretical distributions with the n_top parameter. A visual inspection confirms that many distributions have a very close fit with the empirical data with few exceptions. The distributions are ranked from best fit (highest) to worst fit (lowest). Here we can see that thedweibull distribution has a very poor fit with two peaks in the middle. Using only the RSS score would have been difficult to judge whether or not to use this distribution. The distributions uniform, exponent, and pareto readily showed a poor RSS score can now also be confirmed using the plot.

# Create subplot
fig, ax = plt.subplots(1,2, figsize=(25, 10))
# Pareto plot (PDF with histogram)
dfit.plot(chart='PDF', n_top=11, ax=ax[0])
# Plot the CDF
dfit.plot(chart='CDF', n_top=11, ax=ax[1])
Figure 4. left: PDF, and right: CDF. All fitted theoretical distributions are shown in different colors. (image by the author)

Quantile-Quantile plot.

There is one more plot that we can inspect, which is the QQ plot. The QQ plot compares the empirical probability distributions vs. the theoretical probability distributions by plotting their quantiles against each other. If the two distributions are equal then the points on the QQ-plot will perfectly lie on a straight line y = x. We can make the QQ-plot using the qqplot function (Figure 5). The left panel shows the best fit, and the right panel includes all fitted theoretical distributions. More details on how to interpret the QQ plot can be found in this blog [6].

# Create subplot
fig, ax = plt.subplots(1,2, figsize=(25, 10))
# Plot left panel with best fitting distribution.
dfit.qqplot(X, ax=ax[0])
# plot right panel with all fitted theoretical distributions
dfit.qqplot(X, n_top=11, ax=ax[1])
Figure 5. Left panel: Comparison between empirical distribution vs. best theoretical distribution. Right panel: Comparison between empirical distribution vs. all other theoretical distributions. (image by the author)

Non-parametric Density Estimation is when the population sample is “distribution-free” which means that data do not resemble a common theoretical distribution. In distfit, two non-parametric methods are implemented for non-parametric density fitting: the quantile and percentile methods. Both methods assume that the data does not follow a specific probability distribution. In the case of the quantile method, the quantiles of the data are modeled which can be useful for data with skewed distributions. In the case of the percentile method, the percentiles are modeled which can be useful when data contains multiple peaks. In both methods, the advantage is that it is robust to outliers and does not make assumptions about the underlying distribution. In the code section below we initialize using the method method='quantile' or method='percentile'. All functionalities, such as predicting, and plotting can be used in the same manner as shown in the previous code sections.

# Load library
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the quantile or percentile approach.
dfit = distfit(method='quantile')
dfit= distfit(method='percentile')

# Fit model on input data X and detect the best theoretical distribution.
dfit.fit_transform(X)
# Plot the results
fig, ax = dfit.plot()

In case the random variables are discrete, the distift library contains the option for discrete fitting. The best fit is derived using the binomial distribution. The questions can be summarized as follows: given a list of nonnegative integers, can we fit a probability distribution for a discrete distribution, and compare the quality of the fit? For discrete quantities, the correct term is Probability Mass Function (PMF). As far as discrete distributions go, the PMF for one list of integers is of the form P(k) and can only be fitted to the binomial distribution, with suitable values for n and p, and this method is implemented in distfit. See the code section below where a discrete dataset is created with n=8 and p=0.5. The random variables are given as input to distfit which detected the parameters n=8 and p=0.501366, indicating a very good fit.

# Load library
from scipy.stats import binom
from distfit import distfit

# Parameters for the test-case:
n = 8
p = 0.5
# Generate 10000 randon discrete data points of the distribution of (n, p)
X = binom(n, p).rvs(10000)
# Initialize using the discrete approach.
dfit = distfit(method='discrete')
# Find the best fit.
dfit.fit_transform(X)

# print results
print(dfit.model)
# 'name': 'binom',
# 'score': 0.00010939074999009602,
# 'chi2r': 1.5627249998585145e-05,
# 'n': 8,
# 'p': 0.501366,
# 'CII_min_alpha': 2.0,
# 'CII_max_alpha': 6.0}
# 'distr': <scipy.stats._discrete_distns.binom_gen at 0x14350be2230>,
# 'model': <scipy.stats._distn_infrastructure.rv_discrete_frozen at 0x14397a2b640>,

# Make predictions
results = dfit.predict([0, 2, 8])

Plot the results with the plot functionality.

# Plot the results
dfit.plot()

# Change colors or remove parts of the figure.
# Remove emperical distribution
dfit.plot(emp_properties=None)
# Remove PDF
dfit.plot(pdf_properties=None)
# Remove histograms
dfit.plot(bar_properties=None)
#Remove confidence intervals
dfit.plot(cii_properties=None)

Figure 7. The top figure shows the input data (black dots), and the fitted distribution (blue line). The detected parameters (n=8 and p=0.501366) are very well fitted, given that the generated random variables (n=8, p=0.5). The red vertical bars are the confidence intervals that are set to 0.05 (default). The bottom figure shows the RSS score over n scans. The best fit is detected with the lowest RSS. (image by the author)

Knowing the underlying distribution in your data set is key in many applications. I will summarize a few.

  • Anomaly detection of novelty detection is a clear application of density estimation. This can be achieved by calculating the confidence intervals given the distribution and parameters. Anomaly detection is applicable in a wide range of engineering situations where a clear, early warning of an abnormal condition is required. An anomaly or novelty is an observation that likely lies in a low-density region. The distfit library computes the confidence intervals, together with the probability of a sample being an outlier/novelty given the fitted distribution. Be aware that significance is corrected for multiple testing. Outliers can thus be located outside the confidence interval but not marked as significant.
# Import libraries
import numpy as np
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the parametric approach (default).
dfit = distfit(multtest='fdr_bh', alpha=0.05)

# Fit model on input data X.
dfit.fit_transform(X)
# With the fitted model we can make predictions on new unseen data.
y = [-8, -2, 1, 3, 5, 15]
dfit.predict(y)

# Print results
print(dfit.results['df'])
# y y_proba y_pred P
# 0 -8.0 0.017455 down 0.005818
# 1 -2.0 0.312256 none 0.156128
# 2 1.0 0.402486 none 0.399081
# 3 3.0 0.402486 none 0.402486
# 4 5.0 0.340335 none 0.226890
# 5 15.0 0.003417 up 0.000569

# Plot the results
dfit.plot()

  • Synthetic data generations: Probability distribution fitting can be used to generate synthetic data that is similar to real-world data. By fitting a probability distribution to real-world data, it is possible to generate synthetic data that can be used to test hypotheses and evaluate the performance of algorithms. In the code section below we will first generate random variables from a normal distribution, estimate the distribution parameters, and we can then start creating synthetic data using the fitted distribution.
# Import libraries
import numpy as np
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the parametric approach (default).
dfit = distfit()

# Fit model on input data X.
dfit.fit_transform(X)

# The fitted distribution can now be used to generate new samples.
X_synthetic = model.generate(n=1000)

  • Optimization and compression: Probability distribution fitting can be used to optimize various parameters of a probability distribution, such as the mean and variance, to best fit the data. Finding the best parameters can help to better understand the data. In addition, if hundreds of thousands of observations can be described with only the loc, scale, and arg parameters, it is a very strong compression of the data.
  • An informal investigation of the properties of the input dataset is a very natural use of density estimates. Density estimates can give valuable indications of skewness and multimodality in the data. In some cases, they will yield conclusions that may then be regarded as self-evidently true, while in others, they will point the way to further analysis and data collection.
  • Testing hypotheses: Probability distribution fitting can be used to test hypotheses about the underlying probability distribution of a data set. For example, one can use a goodness-of-fit test to compare the data to a normal distribution or a chi-squared test to compare the data to a Poisson distribution.
  • Modeling: Probability distribution fitting can be used to model complex systems such as weather patterns, stock market trends, biology, population dynamics, and predictive maintenance. By fitting a probability distribution to historical data, it is possible to extract valuable insights and create a model that can be used to make predictions about future behavior.

I touched on the concepts of probability density fitting for parametric, non-parametric, and discrete random variables. With the distfit library, it is straightforward to detect the best theoretical distribution among 89 theoretical distributions. It pipelines the process of density estimation of histograms, estimating the distribution parameters, testing for the goodness of fit, and returning the parameters for the best-fitted distribution. The best fit can be explored with various plot functionalities, such as Histograms, Pareto plots, CDF/PDF plots, and QQ plots. All plots can be customized and easily combined. In addition, predictions can be made on new unseen samples. Another functionality is the creation of synthetic data using the fitted model parameters.

Knowing distribution fitting has great benefits when working in the field of data science. It is not only to better understand, explore, and prepare the data but also to bring fast and lightweight solutions. It is good to realize that the statistical tests only help you to look in the right direction and that choosing the best model is not a statistical question; it is a modeling decision. Choose your model wisely.

Be Safe. Stay Frosty.

Cheers E.

If you found this article helpful, help support my content by signing up for a Medium membership using my referral link or follow me to access similar blogs.

Let’s connect!

  1. W. Kenton, The Basics of Probability Density Function (PDF), With an Example, 2022, Investopedia.
  2. Probability Density Function, Wikipedia.
  3. List of Probability Distributions, Wikipedia
  4. Empirical Distribution Function, Wikipedia
  5. E. Taskesen, What are PCA loadings and how to effectively use Biplots? Medium 2022.
  6. P. Varshney, Q-Q Plots Explained, Medium 2020.
  7. Gamma Distribution, Wikipedia.
  8. A. Downey, Are your data normal? Hint: no. 2018, Blog.


Image by Author.

Knowing the underlying (probability) distribution of your data has many modeling advantages. The easiest manner to determine the underlying distribution is by visually inspecting the random variable(s) using a histogram. With a candidate distribution, a Pareto plot, Probability Distribution Function plot (PDF/CDF), and QQ plot can be created for a better comparison. However, to determine the exact distribution parameters (e.g., loc, scale), it is essential to use quantitative methods. In this blog, I will describe Why it is important to determine the underlying probability distribution for your data set. How to determine the best fit using a visual inspection and in a quantitative manner, and what the differences are between parametric and non-parametric distributions. Analyses are performed using the distfit library, and a notebook is accompanied for easy access and experimenting.

Before you continue reading, have a look at some definitions and terminology. I will be using it a lot in this blog.

“Random variables are variables whose value is unknown or a function that assigns values to each of an experiment’s outcomes. A random variable can be either discrete (having specific values) or continuous (any value in a continuous range)” [1]. It can be a single column in your data set for a specific feature, such as human height. It can also be your entire data set that is measured with a sensor and contains thousands of features.

Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable [1, 2]. The normal distribution is a common example of a PDF (the well-known bell-shaped curve). The term PDF is sometimes also described as “distribution function” of “probability function”.

Theoretical distribution is a form of a PDF. Examples of theoretical distributions are the Normal, Binomial, Exponential, Poisson etc distributions [3]. The distfit library contains 89 theoretical distributions.

A Empirical distribution (or data distribution) is a frequency based distributions of observed random variables (the input data) [4]. A histogram is commonly used to visualize the emperical distribution.

The probability density function is a fundamental concept in statistics. Briefly, for a given random variable X, we aim to specify the function f that gives a natural description of the distribution of X. Although there is a lot of great material that describes these concepts [1], it can remain challenging to understand why it is important to know the underlying data distribution for your data set. Let me try to explain the importance with a small analogy. Suppose you need to go from location A to B, which type of car would you prefer? The answer is straightforward. You will likely start with exploring the terrain. With that information, you can then select the best-suited car (sports car, four-wheel drive, etc). Logically, a sports car is better suited for smooth, flat terrain, while a four-wheel drive is better suited for rough, hilly terrain. In other words, without the exploratory analysis of the terrain, it can be hard to select the best possible car. However, such an exploratory step is easily forgotten or neglected in data modeling.

Having knowledge about the underlying data distribution is an important first step before making modeling decisions.

When it comes to data, it is important to explore the fundamental characteristics of the data too, such as skewness, kurtosis, outliers, distribution shape, univariate, bimodal, etc. Based on these characteristics it is easier to decide which models are best to use because most models have prerequisites for the data. As an example, a well-known and popular technique is Principal Component Analysis (PCA). This method computes the covariance matrix and requires the data to be multivariate normal for the PCA to be valid. In addition, a PCA is also known to be sensitive to outliers. Thus, before doing a PCA step, you need to know whether your data needs a (log)normalization or whether outliers need to be removed. More details about PCA can be found here [5]. The distfit library is developed for the first task; to determine the function f that gives a natural description of the empirical data distribution.

The histogram is a well-known plot in data analysis which is a graphical representation of the distribution of the dataset. The histogram summarizes the number of observations that fall within the bins. With libraries such as matplotlib hist() it is straightforward to make a visual inspection of the data. Changing the range of the number of bins will help to identify whether the density looks like a common probability distribution by the shape of the histogram. An inspection will also give hints whether the data is symmetric or skewed and whether it has multiple peaks or outliers. In most cases, you will observe a distribution shape as depicted in Figure 1.

  • The bell shape of the Normal distribution.
  • The descending or ascending shape of an Exponential or Pareto distribution.
  • The flat shape of the Uniform distribution.
  • The complex shape that does not fit any of the theoretical distributions (e.g, multiple peaks).
Figure 1. The common distribution types. Figures are created using the distfit library and can be further explored in the Colab notebook (link at the bottom). (image by the author).

In case you find distributions with multiple peaks (bimodal or multimodal), the peaks should not disappear with different numbers of bins. Bimodal distributions usually hint toward mixed populations. In addition, if you observe large spikes in density for a given value or a small range of values, it may point toward possible outliers. Outliers often present themselves on the tail of a distribution and are far away from the rest of the density.

A histogram is a great manner to inspect a relatively small number of samples (random variables, or data points). However, when the number of samples increase or more than two histograms are plotted, the visuals can become troublesome, and a visual comparison with a theoretical distribution is difficult to judge. Instead, a Cumulative Distribution Function (CDF) plot or Quantile-Quantile plot (QQ plot) can be more insightful. But these plots require a candidate theoretical distribution(s) that best matches (or fits) with the empirical data distribution. So let’s determine the best theoretical distribution in the next section!

A PDF fit for an empirical data distribution can be discovered in four steps:

  1. Compute density and weights from a histogram. The first step is to flatten the data into an array and create the histogram by grouping observations into bins and counting the number of events in each bin. The choice of the number of bins is important as it controls the coarseness of the distribution. Experimenting with different bin sizes can provide multiple perspectives on the same data. In distfit, the bin width can be manually defined or mathematically determined on the observations themselves. The latter option is the default.
  2. Estimate the distribution parameters from the data. In a parametric approach, the next step is to estimate the shape, location, and scale parameters based on the (selected) theoretical distribution(s). This typically involves methods such as maximum likelihood estimation (MLE) to determine the values of the parameters that best fit the data. For example, if the normal distribution is chosen, the MLE method will estimate the mean and standard deviation of the data.
  3. Check the goodness-of-fit. Once the parameters have been estimated, the fit of the theoretical distribution is evaluated. This can be done using a goodness-of-fit test. Popular statistical tests are Residual Sum of Squares (RSS, also named SSE), Wasserstein, Kolmogorov-Smirnov, and the Energy test (also available in distfit).
  4. Selection of best theoretical distribution. At this point, theoretical distributions are tested and scored using the goodness-of-fit test statistic. The scores can now be sorted and the theoretical distribution with the best score can be selected.

As a final step, the model can be validated using methods such as cross-validation, bootstrapping, or a holdout dataset. It is essential to check if the model is generalizing well and also to check if the assumptions such as independence and normality are met. Once the theoretical distribution has been fitted and validated, it can be used in many applications (keep on reading in the section below).

Having knowledge about distribution fitting has great benefits when working in the field of data science. It is not only to better understand, explore, and prepare the data but also to bring fast and lightweight solutions.

Distfit is a python package for probability density fitting of univariate distributions for random variables. It can find the best fit for parametric, non-parametric, and discrete distributions. In addition, it provides visual insights for better decision-making using various plots. A summary of the most important functionalities are:

  • Finds the best fit for parametric, non-parametric, and discrete distributions.
  • Prediction of outliers/novelties for (new) unseen samples.
  • Generates synthetic data based on the fitted distribution.
  • Plots: Histograms, Probability Density Function plots, Cumulative Density Function plots (CDF), Pareto plots, Quantile-Quantile plots (QQ-plot), Probability plots, and Summary plots.
  • Saving and loading models.

Installation is straightforward and can be done from PyPi.

pip install distfit

With the distfit library it is easy to find the best theoretical distribution with only a few lines of code.

With parametric fitting, we make assumptions about the parameters of the population distribution from the input data. Or in other words, the shape of the histogram should match any of the known theoretical distributions. The advantage of the parametric fitting is that it is computationally efficient and the results are easy to interpret. The disadvantage is that it can be sensitive to outliers when having a low number of samples. The distfit library can determine the best fit across 89 theoretical distributions which are utilized from the scipy library. To score the fit, there are four goodness-of-fit statistical tests; Residual Sum of Squares (RSS or SSE), Wasserstein, Kolmogorov-Smirnov (KS), and Energy. For each fitted theoretical distribution, the loc, scale, and arg parameters are returned, such as mean and standard deviation for normal distribution.

Finding the best matching theoretical distribution for your data set requires a goodness-of-fit statistical test.

In the following example, we will generate data from the normal distribution with mean=2 and standard deviation=4. We will use distfit to estimate these two parameters from the data itself. If you already know the family of distributions (e.g., bell-shape), you can specify a subset of distributions. The default is a subset of common distributions (as depicted in Figure 1). Note that due to the stochastic component, results can differ slightly from what I am showing when repeating the experiment.

# Import libraries
import numpy as np
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the parametric approach.
dfit = distfit(method=’parametric’, todf=True)

# Alternatively limit the search for only a few theoretical distributions.
dfit = distfit(method='parametric', todf=True, distr=['norm', 'expon'])

# Fit model on input data X.
dfit.fit_transform(X)

# Print the bet model results.
dfit.model
# 'stats': 'RSS',
# 'name': 'loggamma',
# 'params': (761.2276, -725.194369, 109.61),
# 'score': 0.0004758991728293508,
# 'loc': -725.1943699246272,
# 'scale': 109.61710960155318,
# 'arg': (761.227612981012,),
# 'CII_min_alpha': -4.542099829801259,
# 'CII_max_alpha': 8.531658573964933
# 'distr': <scipy.stats._continuous_distns.loggamma_gen>,
# 'model': <scipy.stats._distn_infrastructure.rv_continuous_frozen>,

The best fit that is detected (i.e., with the lowest RSS score) is the loggamma distribution. The results of the best fit are stored dfit.model but we can also inspect the fit for all other PDFs as depicted in dfit.summary (see code section below) and create a plot (Figure 2).

# Print the scores of the distributions:
dfit.summary[['distr', 'score', 'loc', 'scale']]
# distr score loc scale
#0 loggamma 0.000476 -725.19437 109.61711
#1 t 0.00048 2.036554 3.970414
#2 norm 0.00048 2.036483 3.970444
#3 beta 0.000481 -72.505842 133.797587
#4 gamma 0.000498 -304.071325 0.051542
#5 lognorm 0.000507 -325.188197 327.201051
#6 genextreme 0.001368 0.508856 3.947172
#7 dweibull 0.005371 2.102396 3.386271
#8 uniform 0.079545 -12.783659 30.766669
#9 expon 0.108689 -12.783659 14.820142
#10 pareto 0.108689 -1073741836.783659 1073741824.0

# Plot the top fitted distributions.
model.plot_summary()

Figure 2. Fitted distributions were sorted on the goodness-of-fit scoring test. (image by the author)

But why did the normal distribution not have the lowest Residual Sum of Squares despite we generated random normal data?

Well, first of all, our input data set will always be a finite list that is bound within a (narrow) range. In contradition, the theoretical (normal) distribution goes to infinity in both directions. Secondly, all statistical analyses are based on models, and all models are merely simplifications of the real world. Or in other words, to approximate the theoretical distributions, we need to use multiple statistical tests, each with its own (dis)advantages. Finally, some distributions have a very flexible character for which the (log)gamma is a clear example. For a large k, the gamma distribution converges to normal distribution [7].

The result is that the top 7 distributions have a similar and low RSS score, among them the normal distribution. We can see in the summary statistics that the estimated parameters for the normal distribution are loc=2.036 and scale=3.97, which is very close to our initially generated random sample population (mean=2, std=4). All things considered, A very good result.

Choosing the best model is not a statistical question; it is a modeling decision.

It is good to realize now that the statistical tests only help you to look in the right direction, and that choosing the best model is not a statistical question; it is a modeling decision [8]. Think about this: the loggamma distributions are heavy right-tailed, while the normal distribution is symmetrical (both tails are similar). This can make a huge difference when using confidence intervals and predicting outliers in the tails. Choose your distribution wisely so that it matches the application.

A best practice is to use both statistics and a visual curation to decide what the best distribution fit is. Using a Pareto plot, PDF/CDF and QQ plots can be some of the best tools to guide those decisions. As an example, Figure 2 illustrates the goodness-of-fit test statistics for which the first 7 PDFs have a very similar and low RSS score. The dweibull distribution is ranked number 8, with also a low RSS score. However, a visual inspection will learn us that, despite having a relatively low RSS score, it is not a good fit after all.

Let’s start plotting the empirical data using a histogram and the PDF. Note that such a plot is also called a Pareto plot. These plots will help to visually guide whether a distribution is a good fit. We can see in Figure 3 the PDF (left) with the confidence intervals and on the right side the CDF plot. The confidence intervals are automatically set to 95% CII but can be changed using the alpha parameter during initialization. When using the plot functionality, it automatically shows the histogram in bars and with a line, PDF/CDF, and confidence intervals. All these properties can be manually customized (see code section below).

# Create subplot
fig, ax = plt.subplots(1,2, figsize=(25, 10))
# Pareto plot (PDF with histogram)
dfit.plot(chart='PDF', ax=ax[0])
# Plot the CDF
dfit.plot(chart='CDF', ax=ax[1])

# Change or remove properties of the chart.
dfit.plot(chart='PDF',
emp_properties=None,
bar_properties=None,
pdf_properties={'color': 'r'},
cii_properties={'color': 'g'})

Figure 3. In both panels is shown the Empirical data distribution with the confidence intervals. Left panel: Pareto plot containing PDF and histogram. Right panel: CDF. (image by the author)

We can also plot all other estimated theoretical distributions with the n_top parameter. A visual inspection confirms that many distributions have a very close fit with the empirical data with few exceptions. The distributions are ranked from best fit (highest) to worst fit (lowest). Here we can see that thedweibull distribution has a very poor fit with two peaks in the middle. Using only the RSS score would have been difficult to judge whether or not to use this distribution. The distributions uniform, exponent, and pareto readily showed a poor RSS score can now also be confirmed using the plot.

# Create subplot
fig, ax = plt.subplots(1,2, figsize=(25, 10))
# Pareto plot (PDF with histogram)
dfit.plot(chart='PDF', n_top=11, ax=ax[0])
# Plot the CDF
dfit.plot(chart='CDF', n_top=11, ax=ax[1])
Figure 4. left: PDF, and right: CDF. All fitted theoretical distributions are shown in different colors. (image by the author)

Quantile-Quantile plot.

There is one more plot that we can inspect, which is the QQ plot. The QQ plot compares the empirical probability distributions vs. the theoretical probability distributions by plotting their quantiles against each other. If the two distributions are equal then the points on the QQ-plot will perfectly lie on a straight line y = x. We can make the QQ-plot using the qqplot function (Figure 5). The left panel shows the best fit, and the right panel includes all fitted theoretical distributions. More details on how to interpret the QQ plot can be found in this blog [6].

# Create subplot
fig, ax = plt.subplots(1,2, figsize=(25, 10))
# Plot left panel with best fitting distribution.
dfit.qqplot(X, ax=ax[0])
# plot right panel with all fitted theoretical distributions
dfit.qqplot(X, n_top=11, ax=ax[1])
Figure 5. Left panel: Comparison between empirical distribution vs. best theoretical distribution. Right panel: Comparison between empirical distribution vs. all other theoretical distributions. (image by the author)

Non-parametric Density Estimation is when the population sample is “distribution-free” which means that data do not resemble a common theoretical distribution. In distfit, two non-parametric methods are implemented for non-parametric density fitting: the quantile and percentile methods. Both methods assume that the data does not follow a specific probability distribution. In the case of the quantile method, the quantiles of the data are modeled which can be useful for data with skewed distributions. In the case of the percentile method, the percentiles are modeled which can be useful when data contains multiple peaks. In both methods, the advantage is that it is robust to outliers and does not make assumptions about the underlying distribution. In the code section below we initialize using the method method='quantile' or method='percentile'. All functionalities, such as predicting, and plotting can be used in the same manner as shown in the previous code sections.

# Load library
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the quantile or percentile approach.
dfit = distfit(method='quantile')
dfit= distfit(method='percentile')

# Fit model on input data X and detect the best theoretical distribution.
dfit.fit_transform(X)
# Plot the results
fig, ax = dfit.plot()

In case the random variables are discrete, the distift library contains the option for discrete fitting. The best fit is derived using the binomial distribution. The questions can be summarized as follows: given a list of nonnegative integers, can we fit a probability distribution for a discrete distribution, and compare the quality of the fit? For discrete quantities, the correct term is Probability Mass Function (PMF). As far as discrete distributions go, the PMF for one list of integers is of the form P(k) and can only be fitted to the binomial distribution, with suitable values for n and p, and this method is implemented in distfit. See the code section below where a discrete dataset is created with n=8 and p=0.5. The random variables are given as input to distfit which detected the parameters n=8 and p=0.501366, indicating a very good fit.

# Load library
from scipy.stats import binom
from distfit import distfit

# Parameters for the test-case:
n = 8
p = 0.5
# Generate 10000 randon discrete data points of the distribution of (n, p)
X = binom(n, p).rvs(10000)
# Initialize using the discrete approach.
dfit = distfit(method='discrete')
# Find the best fit.
dfit.fit_transform(X)

# print results
print(dfit.model)
# 'name': 'binom',
# 'score': 0.00010939074999009602,
# 'chi2r': 1.5627249998585145e-05,
# 'n': 8,
# 'p': 0.501366,
# 'CII_min_alpha': 2.0,
# 'CII_max_alpha': 6.0}
# 'distr': <scipy.stats._discrete_distns.binom_gen at 0x14350be2230>,
# 'model': <scipy.stats._distn_infrastructure.rv_discrete_frozen at 0x14397a2b640>,

# Make predictions
results = dfit.predict([0, 2, 8])

Plot the results with the plot functionality.

# Plot the results
dfit.plot()

# Change colors or remove parts of the figure.
# Remove emperical distribution
dfit.plot(emp_properties=None)
# Remove PDF
dfit.plot(pdf_properties=None)
# Remove histograms
dfit.plot(bar_properties=None)
#Remove confidence intervals
dfit.plot(cii_properties=None)

Figure 7. The top figure shows the input data (black dots), and the fitted distribution (blue line). The detected parameters (n=8 and p=0.501366) are very well fitted, given that the generated random variables (n=8, p=0.5). The red vertical bars are the confidence intervals that are set to 0.05 (default). The bottom figure shows the RSS score over n scans. The best fit is detected with the lowest RSS. (image by the author)

Knowing the underlying distribution in your data set is key in many applications. I will summarize a few.

  • Anomaly detection of novelty detection is a clear application of density estimation. This can be achieved by calculating the confidence intervals given the distribution and parameters. Anomaly detection is applicable in a wide range of engineering situations where a clear, early warning of an abnormal condition is required. An anomaly or novelty is an observation that likely lies in a low-density region. The distfit library computes the confidence intervals, together with the probability of a sample being an outlier/novelty given the fitted distribution. Be aware that significance is corrected for multiple testing. Outliers can thus be located outside the confidence interval but not marked as significant.
# Import libraries
import numpy as np
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the parametric approach (default).
dfit = distfit(multtest='fdr_bh', alpha=0.05)

# Fit model on input data X.
dfit.fit_transform(X)
# With the fitted model we can make predictions on new unseen data.
y = [-8, -2, 1, 3, 5, 15]
dfit.predict(y)

# Print results
print(dfit.results['df'])
# y y_proba y_pred P
# 0 -8.0 0.017455 down 0.005818
# 1 -2.0 0.312256 none 0.156128
# 2 1.0 0.402486 none 0.399081
# 3 3.0 0.402486 none 0.402486
# 4 5.0 0.340335 none 0.226890
# 5 15.0 0.003417 up 0.000569

# Plot the results
dfit.plot()

  • Synthetic data generations: Probability distribution fitting can be used to generate synthetic data that is similar to real-world data. By fitting a probability distribution to real-world data, it is possible to generate synthetic data that can be used to test hypotheses and evaluate the performance of algorithms. In the code section below we will first generate random variables from a normal distribution, estimate the distribution parameters, and we can then start creating synthetic data using the fitted distribution.
# Import libraries
import numpy as np
from distfit import distfit

# Create random normal data with mean=2 and std=4
X = np.random.normal(2, 4, 10000)

# Initialize using the parametric approach (default).
dfit = distfit()

# Fit model on input data X.
dfit.fit_transform(X)

# The fitted distribution can now be used to generate new samples.
X_synthetic = model.generate(n=1000)

  • Optimization and compression: Probability distribution fitting can be used to optimize various parameters of a probability distribution, such as the mean and variance, to best fit the data. Finding the best parameters can help to better understand the data. In addition, if hundreds of thousands of observations can be described with only the loc, scale, and arg parameters, it is a very strong compression of the data.
  • An informal investigation of the properties of the input dataset is a very natural use of density estimates. Density estimates can give valuable indications of skewness and multimodality in the data. In some cases, they will yield conclusions that may then be regarded as self-evidently true, while in others, they will point the way to further analysis and data collection.
  • Testing hypotheses: Probability distribution fitting can be used to test hypotheses about the underlying probability distribution of a data set. For example, one can use a goodness-of-fit test to compare the data to a normal distribution or a chi-squared test to compare the data to a Poisson distribution.
  • Modeling: Probability distribution fitting can be used to model complex systems such as weather patterns, stock market trends, biology, population dynamics, and predictive maintenance. By fitting a probability distribution to historical data, it is possible to extract valuable insights and create a model that can be used to make predictions about future behavior.

I touched on the concepts of probability density fitting for parametric, non-parametric, and discrete random variables. With the distfit library, it is straightforward to detect the best theoretical distribution among 89 theoretical distributions. It pipelines the process of density estimation of histograms, estimating the distribution parameters, testing for the goodness of fit, and returning the parameters for the best-fitted distribution. The best fit can be explored with various plot functionalities, such as Histograms, Pareto plots, CDF/PDF plots, and QQ plots. All plots can be customized and easily combined. In addition, predictions can be made on new unseen samples. Another functionality is the creation of synthetic data using the fitted model parameters.

Knowing distribution fitting has great benefits when working in the field of data science. It is not only to better understand, explore, and prepare the data but also to bring fast and lightweight solutions. It is good to realize that the statistical tests only help you to look in the right direction and that choosing the best model is not a statistical question; it is a modeling decision. Choose your model wisely.

Be Safe. Stay Frosty.

Cheers E.

If you found this article helpful, help support my content by signing up for a Medium membership using my referral link or follow me to access similar blogs.

Let’s connect!

  1. W. Kenton, The Basics of Probability Density Function (PDF), With an Example, 2022, Investopedia.
  2. Probability Density Function, Wikipedia.
  3. List of Probability Distributions, Wikipedia
  4. Empirical Distribution Function, Wikipedia
  5. E. Taskesen, What are PCA loadings and how to effectively use Biplots? Medium 2022.
  6. P. Varshney, Q-Q Plots Explained, Medium 2020.
  7. Gamma Distribution, Wikipedia.
  8. A. Downey, Are your data normal? Hint: no. 2018, Blog.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
artificial intelligenceDatadistributionErdoganFebFindlatest newsmachine learningTaskesentheoretical
Comments (0)
Add Comment