A Metric for HDBSCAN-Generated Clusters | by João Paulo Figueira

How can we determine the equivalent DBSCAN ε parameter for HDBSCAN-generated clusters?

The image above depicts the minimum spanning tree of distances in an HDBSCAN-generated cluster. Image by the author made with the Folium package and OpenStreetMap imagery.

HDBSCAN is a hierarchical density-based clustering algorithm that works under simple assumptions. At a minimum, it only requires the data points to cluster and the minimum number of observations per cluster. The algorithm accepts a distance matrix if the data has a non-obvious associated distance metric.

Like its predecessor, DBSCAN, it automatically detects the number of clusters and the surrounding noise. Unlike DBSCAN, the generated clustering may contain outliers that require special handling during post-processing. Also, HDBSCAN does not need the ε parameter, which, for DBSCAN, is the maximum density-reachable distance between points.

With the absence of such a parameter, HDBSCAN is free to scale the cluster’s density-reachable distance up or down, hence its hierarchical nature. This feature is convenient for users by freeing them from the seemingly arbitrary choice of that parameter’s value. On the other hand, knowing what distance to use is sometimes beneficial for post-processing purposes.

One such instance is the generation of geofences around the detected clusters for geospatial applications. After calculating which points belong to a cluster, we may need to devise a way to either graphically represent the cluster border or infer if newly acquired locations belong to this cluster. Generally, we will require them both, and the HDBSCAN algorithm only provides an approximation to the second. Rerunning the algorithm for each incoming location is usually not feasible.

There are many options to convert the identified cluster points into a geospatial shape, so let’s describe a few.

The Concave Hull

The concave hull algorithm allows us to pick a set of points and draw a fitting line around them in such as way as to respect its perceived shape.

Once the algorithm runs, we get a polygonal envelope that intersects the outermost cluster points, tightly fitting the shape. If we want to use this polygon as a geofence, we run into a conceptual problem: should we add a buffer zone around it? I illustrated this issue in the above article. Adding a buffer around the polygon makes sense to reflect the uncertainty around the location measurement and the intrinsic cluster dimensions. But what are these dimensions? When working with DBSCAN, we can use ε, but what can we use with HDBSCAN?

Bubbling Points

I used the bubbling points technique long ago to generate a geofence shape. The idea is elementary as it involves replacing each point with a circle and then merging them all while retaining the outer polygon. The picture below illustrates the process.

The bubbling process starts with the individual cluster points and places a circle around each. Then, the method merges the circles and produces the final shape. (Image source: Author)

When using DBSCAN, we can use ε to derive the circle radius, but what should we do when using HDBSCAN?

H3 Hexagons

As shown in other articles, we can use Uber’s H3 hexagons to generate a geofence from a set of cluster points. The selection of hexagon size is a requirement as we need to determine a reasonable detail level to cover all locations without gaps. The image below displays one such situation where there is a disconnected hexagon.

The above image shows an H3-generated cluster geofence with gaps. Image by the author made with the Folium package and OpenStreetMap imagery.

Again, we can use DBSCAN’s ε to derive an appropriate resolution, not so with HDBSCAN.

What metric can we extract from the HDBSCAN-generated clusters to assist us in such endeavors? We want to calculate a measure based on the cluster shape appropriate for the buffer radius on a hull-based geofence, a reasonable radius for the bubble-based circles, or a proper side length for the individual H3 hexagons.

Unfortunately, this problem formulation is a bit vague. There is no clear clue where we should look for the appropriate measure. Nevertheless, we can say what this metric is not. We are not looking at the largest distance between cluster points, nor do we want the shortest. If we want to create a buffer around the cluster points, these are not the metrics to use. The first measure will be too big for the purpose, while the second may be vanishingly small.

The solution I am proposing here involves calculating the shortest distance from any given cluster location to all the others and deriving a metric from those. Although one might argue that the equivalent ε would be the maximum of these minimum distances, which is a reasonable argument, the solution I present here uses statistics to derive the metric. I will explain why in what follows.

We then explore how these distances behave to determine an appropriate metric. More concretely, we will determine the statistical distribution of these minimum distances as a basis for our decision.

For this article, I use the Vehicle Energy Dataset data I have been exploring for quite some time. You can follow the article’s code on the GitHub repository, focusing on notebooks 11 and 12.

Minimum Distance Distribution

We start by studying the distribution of minimum distances for each node. This calculation implies looping over all locations and computing the distances to all the others in the cluster. Considering symmetrical distances, we can avoid redundant calculations and improve computation performance. Note that if we were using road distances, we would have to use a more sophisticated approach due to their lack of symmetry. The road distance from point A to point B is not necessarily the same as the distance from B to A due to eventual road restrictions.

There are two slightly different approaches to calculating the minimum distances. The first approach only uses the square pointwise distance matrix, while the second uses network theory. These approaches coincide for the majority of cases but will present different results on some occasions. While the matrix approach is more straightforward and faster, it may consider repeated distances, while the network theory approach will not. When these approaches present different results, the former will be smaller than the latter.

The Distance Matrix Approach

We can easily calculate the minimum distance vector using a symmetric distance matrix. After calculating the matrix, we only need to find the minimum column values, ignoring the diagonal, which is zero.

The function below calculates the symmetrical square matrix of distances using as input two arrays, one for latitudes and the other for longitudes. Note how it avoids redundant calculations by filling the matrix along the diagonal. Each step requires a smaller distance vector, improving performance.

The above function generates the symmetric and square distance matrix for the supplied locations. (Image source: Author)

The function above uses a vectorized version of the haversine distance calculation, as shown below.

This function calculates the pairwise haversine distances. The square function generator broadcasts the initial location to a vector with the exact dimensions of the target locations. (Image source: Author)

Once we have the square matrix, getting to the list of minimum values is very easy. The two functions below illustrate the process.

To calculate de desired metric, we must first compute the distance matrix. Next, we retrieve the list of column minimums (excluding the zero in the diagonal). The input is an Nx2 matrix with all the cluster’s locations, with latitude in the first column and longitude in the second. (Image source: Author)

This function has the advantage of being fast but does neither help us with visual intuition nor excludes duplicates. We can use an alternative solution from network theory to better understand what happens behind the scenes and get unique distances.

The Network Theory Approach

We can think of the square distance matrix calculation as the network representation of distances between all the cluster locations. Each node is a location, while the edges between them carry the symmetric distance as their weight.

Now, we select the shortest edge for each node and remove all others (simplistically speaking). The resulting network is a tree as it connects all the nodes and has no cycles. If the edges carry the shortest possible weight (the geographic distance between locations), we call it a minimum spanning tree.

The image above represents a minimum spanning tree drawn by the NetworkX package. (Image source: Author)

We can calculate this tree by using the NetworkX package. The code below illustrates the calculation process.

The function above computes the minimum distances by calculating the minimum spanning tree associated with the distance network. (Image source: Author)

Unfortunately, this method is an order of magnitude slower than the matrix-based one but excludes duplicate distances and provides a graphical representation of the minimum spanning tree. We can overlay the tree on a map and connect dots, as seen in this article’s main picture.

What Distribution?

We now have two ways to calculate the list of minimum distances. The ensuing discussion uses the minimum spanning tree approach.

To infer the typical distribution of distances, we use the Fitter package, as illustrated by notebook 12. In this notebook, we iterate through all clusters, calculate their minimum spanning tree distances, and determine the most likely distribution. The table below displays the results of our dataset.

The table above shows the frequency of the most likely statistical distributions for the minimum spanning tree distances of the cluster’s distance networks. The most frequent distribution is the Log-Normal. (Image source: Author)

As displayed in the table above, most clusters follow a Log-Normal distribution. The second most frequent distribution is the Gamma, similar to the Log-Normal.

The Proposed Metric

Using these findings, I decided to base the metric on the Log-Normal parameters: m and s (calculated following Brilliant’s published formulas).

The formula above defines the mean of the Log-Normal distribution. (Image source: Author)

Above is the formula for the standard deviation of the Log-Normal distribution. (Image source: Author)

We calculate the values of 𝝁 and 𝜎 as the average and standard deviation of the sampled data. As the metric, I decided to use the following quantity.

The estimate for the DBSCAN equivalent of the ε parameter for HDBSCAN is similar to the formula for the Normal distribution, corresponding to roughly 97% of the data. (Image source: Author)

The function that calculates the cluster metric from the list of minimum distances is quite simple.

The function above calculates the cluster metric from the minimum distances assuming a Log-Normal distribution. (Image source: Author)

We can now test this approach by drawing the geofences created with this metric and cluster locations on a map.

It’s time to look at how this metric behaves when drawing HDBSCAN-generated cluster borders. We start with the inflated concave hull in the picture below.

The picture above shows the result of calculating the cluster’s concave hull shape and then inflating it using the cluster metric. You can see the cluster locations in red, the concave hull shape as a thin blue line, and the inflated shape as a thicker line. The outer polygon contains 194 points. (Image source: Author)

The bubble-shaped geofence uses the same cluster metric as the radius for the merged circles, centered around each cluster location. The resulting shape, pictured below, exaggerates the buffer area around the points. Note that this derives from the method and not from the metric.

The picture above shows the result of merging all the point-centered circles with a radius equal to the computed cluster metric. The polygon contains 279 points. (Image source: Author)

Finally, we test the H3-based approach, using the cluster metric to derive the hexagon detail level. Due to the fixed geospatial nature of H3 hexagons, the buffer zone might not work as expected, so you might have to tweak the calculations using a multiplicative factor. The geofence below uses a multiplicative factor of two.

The picture above shows an H3-generated geofence using a tight fit. There are 29 hexagons, and the polygon only contains 45 points. (Image source: Author)

The image below shows the same cluster with a larger geofence, calculated using a larger multiplier (three, in this case).

The picture above shows an H3-generated geofence using a loose fit. There are eight hexagons, and the polygon only contains 21 points. (Image source: Author)

In this article, I tried to derive a metric to mimic the effect of DBSCAN’s ε parameter on HDBSCAN-generated clusters. My proposal for this metric is to infer the typical statistical distribution of the minimum spanning tree distances for the set of clusters and then use a simple function of the mean and standard deviation. I intended to find a robust metric to apply to some of the geofence generation algorithms that I successfully used in the past with DBSCAN. The lack of the ε parameter, defined in advance for DBSCAN, means there is no apparent support for a metric. The solution is neither unique nor necessarily the best but seems appropriate for the task at hand, especially when using the inflated concave hull and H3 approaches.

joaofig/ved-explore: Exploration of the Vehicle Energy Dataset (github.com)

The hdbscan Clustering Library — hdbscan 0.8.1 documentation

H3: Uber’s Hexagonal Hierarchical Spatial Index | Uber Blog

NetworkX — NetworkX documentation

FITTER documentation — fitter 1.4.1 documentation