Outlier Detection Using Distribution Fitting in Univariate Datasets | by Erdogan Taskesen | Feb, 2023

By Jessie Hobb On Feb 18, 2023

Learn how to detect outliers using Probability Density Functions and bring fast and explainable results without training labels

Anomaly or novelty detection is applicable in a wide range of situations where a clear, early warning of an abnormal condition is required, such as for sensor data, security operations, and fraud detection among others. Due to the nature of the problem, outliers do not present themselves frequently, and due to the lack of labels, it can become difficult to create supervised models. Outliers are also called anomalies or novelties but there are some fundamental differences in the underlying assumptions and the modeling process. Here I will discuss the fundamental differences between anomalies and novelties and the concepts of outlier detection. With a hands-on example, I will demonstrate how to create an unsupervised model for the detection of anomalies and novelties using probability density fitting for univariate data sets. The distfit library is used across all examples.

Anomalies and novelties are both observations that deviate from what is standard, normal, or expected. The collective name for such observations is the outlier. In general, outliers present themselves on the (relative) tail of a distribution and are far away from the rest of the density. In addition, if you observe large spikes in density for a given value or a small range of values, it may point toward possible outliers. Although the aim for anomaly and novelty detection is the same, there are some conceptual modeling differences [1], briefly summarized as follows:

Anomalies are outliers that are known to be present in the training data and deviate from what is normal or expected. In such cases, we should aim to fit a model on the observations that have the expected/normal behavior (also named inliers) and ignore the deviant observations. The observations that fall outside the expected/normal behavior are the outliers.

Novelties are outliers that are not known to be present in the training data. The data does not contain observations that deviate from what is normal/expected. Novelty detection can be more challenging as there is no reference of an outlier. Domain knowledge is more important in such cases to prevent model overfitting on the inliers.

I just pointed out that the difference between anomalies and novelties is in the modeling process. But there is more to it. Before we can start modeling, we need to set some expectations about “how an outlier should look like”. There are roughly three types of outliers (Figure 1), summarized as follows:

Global outliers (also named point outliers) are single, and independent observations that deviate from all other observations [1, 2]. When someone speaks about “outliers”, it is usually about the global outlier.
Contextual outliers occur when a particular observation doesn’t fit in a specific context. A context can present itself in a bimodal or multimodal distribution, and an outlier deviates within the context. For instance, temperatures below 0 are normal in winter but are unusual in the summer and are then called outliers. Besides time series and seasonal data, other known applications are in sensor data [3] and security operations [4].
Collective outliers (or group outliers) are a group of similar/related instances with unusual behavior compared to the rest of the data set [5]. The group of outliers can form a bimodal or multimodal distribution because they often indicate a different type of problem than individual outliers, such as a batch processing error or a systemic problem in the data generation process. Note that the Detection of collective outliers typically requires a different approach than detecting individual outliers.

Figure 1. From left to right an example of global, contextual, and collective outliers. Image by the author.

One more part that needs to be discussed before we can start modeling outliers is the data set part. From a data set perspective, outliers can be detected based on a single feature (univariate) or based on multiple features per observation (multivariate). Keep on reading because the next section is about univariate and multivariate analysis.

A modeling approach for the detection of any type of outlier has two main flavors; univariate and multivariate analysis (Figure 2). I will focus on the detection of outliers for univariate random variables but not before I will briefly describe the differences:

The univariate approach is when the sample/observation is marked as an outlier using one variable at a time, i.e., a person’s age, weight, or a single variable in time series data. Analyzing the data distribution in such cases is well-suited for outlier detection.
The multivariate approach is when the sample/observations contain multiple features that can be jointly analyzed, such as age, weight, and height together. It is well suited to detect outliers with features that have (non-)linear relationships or where the distribution of values in each variable is (highly) skewed. In these cases, the univariate approach may not be as effective, as it does not take into account the relationships between variables.

Figure 2. Overview of univariate vs. multivariate analysis for the detection of outliers. Image by the author.

There are various (non-)parametric manners for the detection of outliers in univariate data sets, such as Z-scores, Tukey’s fences, and density-based approaches among others. The common theme across the methods is that the underlying distribution is modeled. The distfit library [6] is therefore well suited for outlier detection as it can determine the Probability Density Function (PDF) for univariate random variables but can also model univariate data sets in a non-parametric manner using percentiles or quantiles. Moreover, it can be used to model anomalies or novelties in any of the three categories; global, contextual, or collective outliers. See this blog for more detailed information about distribution fitting using the distfit library [6]. The modeling approach can be summarized as follows:

Compute the fit for your random variable across various PDFs, then rank PDFs using the goodness of fit test, and evaluate with a bootstrap approach. Note that non-parametric approaches with quantiles or percentiles can also be used.
Visually inspect the histogram, PDFs, CDFs, and Quantile-Quantile (QQ) plot.
Choose the best model based on steps 1 and 2, but also make sure the properties of the (non-)parametric model (e.g., the PDF) match the use case. Choosing the best model is not just a statistical question; it is also a modeling decision.
Make predictions on new unseen samples using the (non-)parametric model such as the PDF.

Let’s start with a simple and intuitive example to demonstrate the working of novelty detection for univariate variables using distribution fitting and hypothesis testing. In this example, our aim is to pursue a novelty approach for the detection of global outliers, i.e., the data does not contain observations that deviate from what is normal/expected. This means that, at some point, we should carefully include domain knowledge to set the boundaries of what an outlier looks like.

Suppose we have measurements of 10.000 human heights. Let’s generate random normal data with mean=163 and std=10 that represents our human height measurements. We expect a bell-shaped curve that contains two tails; those with smaller and larger heights than average. Note that due to the stochastic component, results can differ slightly when repeating the experiment.

# Import library
import numpy as np# Generate 10000 samples from a normal distribution
X = np.random.normal(163, 10, 10000)

1. Determine the PDFs that best fit Human Height.

Before we can detect any outliers, we need to fit a distribution (PDF) on what is normal/expected behavior for human height. The distfit library can fit up to 89 theoretical distributions. I will limit the search to only common/popular probability density functions as we readily expect a bell-shaped curve (see the following code section).

# Install distfit library
pip install distfit

# Import library
from distfit import distfit# Initialize for common/popular distributions with bootstrapping.
dfit = distfit(distr='popular', n_boots=100)
# Estimate the best fit
results = dfit.fit_transform(X)
# Plot the RSS and bootstrap results for the top scoring PDFs
dfit.plot_summary(n_top=10)
# Show the plot
plt.show()

Figure 3. The RSS scores for the fit of human height with the most common distributions.

The loggamma PDF is detected as the best fit for human height according to the goodness of fit test statistic (RSS) and the bootstrapping approach. Note that the bootstrap approach evaluates whether there was overfitting for the PDFs. The bootstrap score ranges between [0, 1], and depicts the fit-success ratio across the number of bootstraps (n_bootst=100) for the PDF. It can also be seen from Figure 3 that, besides the loggamma PDF, multiple other PDFs are detected too with a low Residual Sum of Squares, i.e., Beta, Gamma, Normal, T-distribution, Loggamma, generalized extreme value, and the Weibull distribution (Figure 3). However, only five PDFs did pass the bootstrap approach.

2: Visual inspection of the best-fitting PDFs.

A best practice is to visually inspect the distribution fit. The distfit library contains built-in functionalities for plotting, such as the histogram combined with the PDF/CDF but also QQ-plots. The plot can be created as follows:

# Make figure
fig, ax = plt.subplots(1, 2, figsize=(20, 8))# PDF for only the best fit
dfit.plot(chart='PDF', n_top=1, ax=ax[0]);
# CDF for the top 10 fits
dfit.plot(chart='CDF', n_top=10, ax=ax[1])
# Show the plot
plt.show()

Figure 4. Pareto plot with the histogram for the empirical data and the estimated PDF. Left panel: PDF with the best fit (Beta). Right panel: CDF with the top 10 best fits. The confidence intervals are based on alpha=0.05.

A visual inspection confirms the goodness of fit scores for the top-ranked PDFs. However, there is one exception, the Weibull distribution (yellow line in Figure 4) appears to have two peaks. In other words, although the RSS is low, a visual inspection does not show a good fit for our random variable. Note that the bootstrap approach readily excluded the Weibull distribution and now we know why.

Step 3: Decide by also using the PDF properties.

The last step may be the most challenging step because there are still five candidate distributions that scored very well in the goodness of fit test, the bootstrap approach, and the visual inspection. We should now decide which PDF matches best on its fundamental properties to model human height. I will stepwise elaborate on the properties of the top candidate distributions with respect to our use case of modeling human height.

The Normal distribution is a typical choice but it is important to note that the assumption of normality for human height may not hold in all populations. It has no heavy tails and therefore it may not capture outliers very well.

The Students T-distribution is often used as an alternative to the normal distribution when the sample size is small or the population variance is unknown. It has heavier tails than the normal distribution, which can better capture the presence of outliers or skewness in the data. In case of low sample sizes, this distribution could have been an option but as the sample size increases, the t-distribution approaches the normal distribution.

The Gamma distribution is a continuous distribution that is often used to model data that are positively skewed, meaning that there is a long tail of high values. Human height may be positively skewed due to the presence of outliers, such as very tall individuals. However, the bootstrap appraoch showed a poor fit.

The Log-gamma distribution has a skewed shape, similar to the gamma distribution, but with heavier tails. It models the log of the values which makes it more appropriate to use when the data has large number of high values.

The Beta distribution is typically used to model proportions or rates [9], rather than continuous variables such as in our use-case for height. It would have been an appropriate choice if height was divided by a reference value, such as the median height. So despite it scores best on the goodness of fit test, and we confirm a good fit using a visual inspection, it would not be my first choice.

The Generalized Extreme Value (GEV) distribution can be used to model the distribution of extreme values in a population, such as the maximum or minimum values. It also allows heavy tails which can capture the presence of outliers or skewness in the data. However, it is typically used to model the distribution of extreme values [10], rather than the overall distribution of a continuous variable such as human height.

The Dweibull distribution may not be the best match for this research question as it is typically used to model data that has a monotonic increasing or decreasing trend, such as time-to-failure or time-to-event data [11]. Human height data may not have a clear monotonic trend. The visual inspection of the PDF/CDF/QQ-plot also showed no good match.

To summarize, the loggamma distribution may be the best choice in this particular use case after considering the goodness of fit test, the bootstrap approach, the visual inspection, and now also based on the PDF properties related to the research question. Note that we can easily specify the loggamma distribution and re-fit on the input data (see code section) if required (see code section).

# Initialize for common or popular distributions.
dfit = distfit(distr='loggamma', alpha=0.01, bound='both')# Estimate the best fit
results = dfit.fit_transform(X)
# Print model parameters
print(dfit.model)
# {'name': 'loggamma',
#  'score': 6.676334203908028e-05,
#  'loc': -1895.1115726427015,
#  'scale': 301.2529482991781,
#  'arg': (927.596119872062,),
#  'params': (927.596119872062, -1895.1115726427015, 301.2529482991781),
#  'color': '#e41a1c',
#  'CII_min_alpha': 139.80923469906566,
#  'CII_max_alpha': 185.8446340627711}
# Save model
dfit.save('./human_height_model.pkl')

Step 4. Predictions for new unseen samples.

With the fitted model we can assess the significance of new (unseen) samples and detect whether they deviate from what is normal/expected (the inliers). Predictions are made on the theoretical probability density function, making it lightweight, fast, and explainable. The confidence intervals for the PDF are set using the alpha parameter. This is the part where domain knowledge is required because there are no known outliers in our data set present. In this case, I set the confidence interval (CII) alpha=0.01 which results in a minimum boundary of 139.8cm and a maximum boundary of 185.8cm. The default is that both tails are analyzed but this can be changed using the bound parameter (see code section above).

We can use the predict function to make new predictions on new unseen samples, and create the plot with the prediction results (Figure 5). Be aware that significance is corrected for multiple testing: multtest='fdr_bh'. Outliers can thus be located outside the confidence interval but not marked as significant.

# New human heights
y = [130, 160, 200]# Make predictions
results = dfit.predict(y, alpha=0.01, multtest='fdr_bh', todf=True)
# The prediction results
results['df']
#        y   y_proba y_pred         P
# 0  130.0  0.000642   down  0.000428
# 1  160.0  0.391737   none  0.391737
# 2  200.0  0.000321     up  0.000107
plt.figure();
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
# PDF for only the best fit
dfit.plot(chart='PDF', ax=ax[0]);
# CDF for the top 10 fits
dfit.plot(chart='CDF', ax=ax[1])
# Show plot
plt.show()

Figure 5. Left Panel: Histogram for the empirical data and the log-gamma PDF. The black line is the empirical data distribution. The red line is the fitted theoretical distribution. The red vertical lines are the confidence intervals that are set to 0.01. The green dashed lines are detected as outliers and the red crosses are not significant. (image by the author)

The results of the predictions are stored in results and contains multiple columns: y, y_proba, y_pred, and P . The P stands for the raw p-values and y_proba are the probabilities after multiple test corrections (default: fdr_bh). Note that a data frame is returned when using the todf=True parameter. Two observations have a probability alpha<0.01 and are marked as significant up or down.

So far we have seen how to fit a model and detect global outliers for novelty detection. Here we will use real-world data for the detection of anomalies. The use of real-world data is usually much more challenging to work with. To demonstrate this, I will download the data set of natural gas spot price from Thomson Reuters [7] which is an open-source and freely available dataset [8]. After downloading, importing, and removing nan values, there are 6555 data points across 27 years.

# Initialize distfit
dfit = distfit()# Import dataset
df = dfit.import_example(data='gas_spot_price')
print(df)
#             price
# date             
# 2023-02-07   2.35
# 2023-02-06   2.17
# 2023-02-03   2.40
# 2023-02-02   2.67
# 2023-02-01   2.65
#           ...
# 1997-01-13   4.00
# 1997-01-10   3.92
# 1997-01-09   3.61
# 1997-01-08   3.80
# 1997-01-07   3.82
# [6555 rows x 1 columns]

Visually inspection of the data set.

To visually inspect the data, we can create a line plot of the natural gas spot price to see whether there are any obvious trends or other relevant matters (Figure 6). It can be seen that 2003 and 2021 contain two major peaks (which hint toward global outliers). Furthermore, the price actions seem to have a natural movement with local highs and lows. Based on this line plot, we can build an intuition of the expected distribution. The price moves mainly in the range [2, 5] but with some exceptional years from 2003 to 2009, where the range was more between [6, 9].

# Get unique years
dfit.lineplot(df, xlabel='Years', ylabel='Natural gas spot price', grid=True)# Show the plot
plt.show()

Figure 6. Open data source data set of Natural gas spot price from Thomson Reuters [7, 8].

Let’s use distfit to deeper investigate the data distribution, and determine the accompanying PDF. The search space is set to all available PDFs and the bootstrap approach is set to 100 to evaluate the PDFs for overfitting.

# Initialize
from distfit import distfit# Fit distribution
dfit = distfit(distr='full', n_boots=100)
# Search for best theoretical fit.
results = dfit.fit_transform(df['price'].values)
# Plot PDF/CDF
fig, ax = plt.subplots(1,2, figsize=(25, 10))
dfit.plot(chart='PDF', n_top=10, ax=ax[0])
dfit.plot(chart='CDF', n_top=10, ax=ax[1])
# Show plot
plt.show()

Figure 7. left: PDF, and right: CDF. All fitted theoretical distributions are shown in different colors. (image by the author)

The best-fitting PDF is Johnsonsb (Figure 7) but when we plot the empirical data distributions, the PDF (red line) does not precisely follow the empirical data. In general, we can confirm that the majority of data points are located in the range [2, 5] (this is where the peak of the distribution is) and that there is a second smaller peak in the distribution with price actions around value 6. This is also the point where the PDF does not smoothly fit the empirical data and causes some undershoots and overshoots. With the summary plot and QQ plot, we can investigate the fit even better. Let’s create these two plots with the following lines of code:

# Plot Summary and QQ-plot
fig, ax = plt.subplots(1,2, figsize=(25, 10))# Summary plot
dfit.plot_summary(ax=ax[0])
# QQplot
dfit.qqplot(df['price'].values, n_top=10, ax=ax[1])
# Show the plot
plt.show()

It is interesting to see in the summary plot that the goodness of fit test showed good results (low score) among all the top distributions. However, when we look at the results of the bootstrap approach, it shows that all, except one distribution, are overfitted (Figure 8A, orange line). This is not entirely unexpected because we already noticed overshooting and undershooting. The QQ plot confirms that the fitted distributions deviate strongly from the empirical data (Figure 8B). Only the Johnsonsb distribution showed a (borderline) good fit.

Figure 8. A. left panel: PDFs are sorted on the bootstrap score and the goodness of fit test. B. right panel: QQ-plot containing the comparison between empirical distribution vs. all other theoretical distributions. (image by the author)

Detection of Global and Contextual Outliers.

We will continue using the Johnsonsb distribution and the predict functionality for the detection of outliers. We already know that our data set contains outliers as we followed the anomaly approach, i.e., the distribution is fitted on the inliers, and observations that now fall outside the confidence intervals can be marked as potential outliers. With the predict function and the lineplot we can detect and plot the outliers. It can be seen from Figure 9 that the global outliers are detected but also some contextual outliers, despite we did not model for it explicitly. Red bars are the underrepresented outliers and green bars are the overrepresented outliers. The alpha parameter can be set to tune the confidence intervals.

# Make prediction
dfit.predict(df['price'].values, alpha=0.05, multtest=None)# Line plot with data points outside the confidence interval.
dfit.lineplot(df['price'], labels=df.index)

Figure 9. Plotting outliers after fitting distribution and making predictions. Green bars are outliers outside the upper bound of the 95% CII. Red bars are outliers outside the lower bound of the 95% CII.