How can we determine the equivalent DBSCAN ε parameter for HDBSCAN-generated clusters?
HDBSCAN is a hierarchical density-based clustering algorithm that works under simple assumptions. At a minimum, it only requires the data points to cluster and the minimum number of observations per cluster. The algorithm accepts a distance matrix if the data has a non-obvious associated distance metric.
Like its predecessor, DBSCAN, it automatically detects the number of clusters and the surrounding noise. Unlike DBSCAN, the generated clustering may contain outliers that require special handling during post-processing. Also, HDBSCAN does not need the ε parameter, which, for DBSCAN, is the maximum density-reachable distance between points.
With the absence of such a parameter, HDBSCAN is free to scale the cluster’s density-reachable distance up or down, hence its hierarchical nature. This feature is convenient for users by freeing them from the seemingly arbitrary choice of that parameter’s value. On the other hand, knowing what distance to use is sometimes beneficial for post-processing purposes.
One such instance is the generation of geofences around the detected clusters for geospatial applications. After calculating which points belong to a cluster, we may need to devise a way to either graphically represent the cluster border or infer if newly acquired locations belong to this cluster. Generally, we will require them both, and the HDBSCAN algorithm only provides an approximation to the second. Rerunning the algorithm for each incoming location is usually not feasible.
There are many options to convert the identified cluster points into a geospatial shape, so let’s describe a few.
The Concave Hull
The concave hull algorithm allows us to pick a set of points and draw a fitting line around them in such as way as to respect its perceived shape.
Once the algorithm runs, we get a polygonal envelope that intersects the outermost cluster points, tightly fitting the shape. If we want to use this polygon as a geofence, we run into a conceptual problem: should we add a buffer zone around it? I illustrated this issue in the above article. Adding a buffer around the polygon makes sense to reflect the uncertainty around the location measurement and the intrinsic cluster dimensions. But what are these dimensions? When working with DBSCAN, we can use ε, but what can we use with HDBSCAN?
Bubbling Points
I used the bubbling points technique long ago to generate a geofence shape. The idea is elementary as it involves replacing each point with a circle and then merging them all while retaining the outer polygon. The picture below illustrates the process.
When using DBSCAN, we can use ε to derive the circle radius, but what should we do when using HDBSCAN?
H3 Hexagons
As shown in other articles, we can use Uber’s H3 hexagons to generate a geofence from a set of cluster points. The selection of hexagon size is a requirement as we need to determine a reasonable detail level to cover all locations without gaps. The image below displays one such situation where there is a disconnected hexagon.
Again, we can use DBSCAN’s ε to derive an appropriate resolution, not so with HDBSCAN.
What metric can we extract from the HDBSCAN-generated clusters to assist us in such endeavors? We want to calculate a measure based on the cluster shape appropriate for the buffer radius on a hull-based geofence, a reasonable radius for the bubble-based circles, or a proper side length for the individual H3 hexagons.
Unfortunately, this problem formulation is a bit vague. There is no clear clue where we should look for the appropriate measure. Nevertheless, we can say what this metric is not. We are not looking at the largest distance between cluster points, nor do we want the shortest. If we want to create a buffer around the cluster points, these are not the metrics to use. The first measure will be too big for the purpose, while the second may be vanishingly small.
The solution I am proposing here involves calculating the shortest distance from any given cluster location to all the others and deriving a metric from those. Although one might argue that the equivalent ε would be the maximum of these minimum distances, which is a reasonable argument, the solution I present here uses statistics to derive the metric. I will explain why in what follows.
We then explore how these distances behave to determine an appropriate metric. More concretely, we will determine the statistical distribution of these minimum distances as a basis for our decision.
For this article, I use the Vehicle Energy Dataset data I have been exploring for quite some time. You can follow the article’s code on the GitHub repository, focusing on notebooks 11 and 12.
Minimum Distance Distribution
We start by studying the distribution of minimum distances for each node. This calculation implies looping over all locations and computing the distances to all the others in the cluster. Considering symmetrical distances, we can avoid redundant calculations and improve computation performance. Note that if we were using road distances, we would have to use a more sophisticated approach due to their lack of symmetry. The road distance from point A to point B is not necessarily the same as the distance from B to A due to eventual road restrictions.
There are two slightly different approaches to calculating the minimum distances. The first approach only uses the square pointwise distance matrix, while the second uses network theory. These approaches coincide for the majority of cases but will present different results on some occasions. While the matrix approach is more straightforward and faster, it may consider repeated distances, while the network theory approach will not. When these approaches present different results, the former will be smaller than the latter.
The Distance Matrix Approach
We can easily calculate the minimum distance vector using a symmetric distance matrix. After calculating the matrix, we only need to find the minimum column values, ignoring the diagonal, which is zero.
The function below calculates the symmetrical square matrix of distances using as input two arrays, one for latitudes and the other for longitudes. Note how it avoids redundant calculations by filling the matrix along the diagonal. Each step requires a smaller distance vector, improving performance.
The function above uses a vectorized version of the haversine distance calculation, as shown below.
Once we have the square matrix, getting to the list of minimum values is very easy. The two functions below illustrate the process.
This function has the advantage of being fast but does neither help us with visual intuition nor excludes duplicates. We can use an alternative solution from network theory to better understand what happens behind the scenes and get unique distances.
The Network Theory Approach
We can think of the square distance matrix calculation as the network representation of distances between all the cluster locations. Each node is a location, while the edges between them carry the symmetric distance as their weight.
Now, we select the shortest edge for each node and remove all others (simplistically speaking). The resulting network is a tree as it connects all the nodes and has no cycles. If the edges carry the shortest possible weight (the geographic distance between locations), we call it a minimum spanning tree.
We can calculate this tree by using the NetworkX package. The code below illustrates the calculation process.
Unfortunately, this method is an order of magnitude slower than the matrix-based one but excludes duplicate distances and provides a graphical representation of the minimum spanning tree. We can overlay the tree on a map and connect dots, as seen in this article’s main picture.
What Distribution?
We now have two ways to calculate the list of minimum distances. The ensuing discussion uses the minimum spanning tree approach.
To infer the typical distribution of distances, we use the Fitter package, as illustrated by notebook 12. In this notebook, we iterate through all clusters, calculate their minimum spanning tree distances, and determine the most likely distribution. The table below displays the results of our dataset.
As displayed in the table above, most clusters follow a Log-Normal distribution. The second most frequent distribution is the Gamma, similar to the Log-Normal.
The Proposed Metric
Using these findings, I decided to base the metric on the Log-Normal parameters: m and s (calculated following Brilliant’s published formulas).
We calculate the values of 𝝁 and 𝜎 as the average and standard deviation of the sampled data. As the metric, I decided to use the following quantity.
The function that calculates the cluster metric from the list of minimum distances is quite simple.
We can now test this approach by drawing the geofences created with this metric and cluster locations on a map.
It’s time to look at how this metric behaves when drawing HDBSCAN-generated cluster borders. We start with the inflated concave hull in the picture below.
The bubble-shaped geofence uses the same cluster metric as the radius for the merged circles, centered around each cluster location. The resulting shape, pictured below, exaggerates the buffer area around the points. Note that this derives from the method and not from the metric.
Finally, we test the H3-based approach, using the cluster metric to derive the hexagon detail level. Due to the fixed geospatial nature of H3 hexagons, the buffer zone might not work as expected, so you might have to tweak the calculations using a multiplicative factor. The geofence below uses a multiplicative factor of two.
The image below shows the same cluster with a larger geofence, calculated using a larger multiplier (three, in this case).
In this article, I tried to derive a metric to mimic the effect of DBSCAN’s ε parameter on HDBSCAN-generated clusters. My proposal for this metric is to infer the typical statistical distribution of the minimum spanning tree distances for the set of clusters and then use a simple function of the mean and standard deviation. I intended to find a robust metric to apply to some of the geofence generation algorithms that I successfully used in the past with DBSCAN. The lack of the ε parameter, defined in advance for DBSCAN, means there is no apparent support for a metric. The solution is neither unique nor necessarily the best but seems appropriate for the task at hand, especially when using the inflated concave hull and H3 approaches.
joaofig/ved-explore: Exploration of the Vehicle Energy Dataset (github.com)
The hdbscan Clustering Library — hdbscan 0.8.1 documentation
H3: Uber’s Hexagonal Hierarchical Spatial Index | Uber Blog
How can we determine the equivalent DBSCAN ε parameter for HDBSCAN-generated clusters?
HDBSCAN is a hierarchical density-based clustering algorithm that works under simple assumptions. At a minimum, it only requires the data points to cluster and the minimum number of observations per cluster. The algorithm accepts a distance matrix if the data has a non-obvious associated distance metric.
Like its predecessor, DBSCAN, it automatically detects the number of clusters and the surrounding noise. Unlike DBSCAN, the generated clustering may contain outliers that require special handling during post-processing. Also, HDBSCAN does not need the ε parameter, which, for DBSCAN, is the maximum density-reachable distance between points.
With the absence of such a parameter, HDBSCAN is free to scale the cluster’s density-reachable distance up or down, hence its hierarchical nature. This feature is convenient for users by freeing them from the seemingly arbitrary choice of that parameter’s value. On the other hand, knowing what distance to use is sometimes beneficial for post-processing purposes.
One such instance is the generation of geofences around the detected clusters for geospatial applications. After calculating which points belong to a cluster, we may need to devise a way to either graphically represent the cluster border or infer if newly acquired locations belong to this cluster. Generally, we will require them both, and the HDBSCAN algorithm only provides an approximation to the second. Rerunning the algorithm for each incoming location is usually not feasible.
There are many options to convert the identified cluster points into a geospatial shape, so let’s describe a few.
The Concave Hull
The concave hull algorithm allows us to pick a set of points and draw a fitting line around them in such as way as to respect its perceived shape.
Once the algorithm runs, we get a polygonal envelope that intersects the outermost cluster points, tightly fitting the shape. If we want to use this polygon as a geofence, we run into a conceptual problem: should we add a buffer zone around it? I illustrated this issue in the above article. Adding a buffer around the polygon makes sense to reflect the uncertainty around the location measurement and the intrinsic cluster dimensions. But what are these dimensions? When working with DBSCAN, we can use ε, but what can we use with HDBSCAN?
Bubbling Points
I used the bubbling points technique long ago to generate a geofence shape. The idea is elementary as it involves replacing each point with a circle and then merging them all while retaining the outer polygon. The picture below illustrates the process.
When using DBSCAN, we can use ε to derive the circle radius, but what should we do when using HDBSCAN?
H3 Hexagons
As shown in other articles, we can use Uber’s H3 hexagons to generate a geofence from a set of cluster points. The selection of hexagon size is a requirement as we need to determine a reasonable detail level to cover all locations without gaps. The image below displays one such situation where there is a disconnected hexagon.
Again, we can use DBSCAN’s ε to derive an appropriate resolution, not so with HDBSCAN.
What metric can we extract from the HDBSCAN-generated clusters to assist us in such endeavors? We want to calculate a measure based on the cluster shape appropriate for the buffer radius on a hull-based geofence, a reasonable radius for the bubble-based circles, or a proper side length for the individual H3 hexagons.
Unfortunately, this problem formulation is a bit vague. There is no clear clue where we should look for the appropriate measure. Nevertheless, we can say what this metric is not. We are not looking at the largest distance between cluster points, nor do we want the shortest. If we want to create a buffer around the cluster points, these are not the metrics to use. The first measure will be too big for the purpose, while the second may be vanishingly small.
The solution I am proposing here involves calculating the shortest distance from any given cluster location to all the others and deriving a metric from those. Although one might argue that the equivalent ε would be the maximum of these minimum distances, which is a reasonable argument, the solution I present here uses statistics to derive the metric. I will explain why in what follows.
We then explore how these distances behave to determine an appropriate metric. More concretely, we will determine the statistical distribution of these minimum distances as a basis for our decision.
For this article, I use the Vehicle Energy Dataset data I have been exploring for quite some time. You can follow the article’s code on the GitHub repository, focusing on notebooks 11 and 12.
Minimum Distance Distribution
We start by studying the distribution of minimum distances for each node. This calculation implies looping over all locations and computing the distances to all the others in the cluster. Considering symmetrical distances, we can avoid redundant calculations and improve computation performance. Note that if we were using road distances, we would have to use a more sophisticated approach due to their lack of symmetry. The road distance from point A to point B is not necessarily the same as the distance from B to A due to eventual road restrictions.
There are two slightly different approaches to calculating the minimum distances. The first approach only uses the square pointwise distance matrix, while the second uses network theory. These approaches coincide for the majority of cases but will present different results on some occasions. While the matrix approach is more straightforward and faster, it may consider repeated distances, while the network theory approach will not. When these approaches present different results, the former will be smaller than the latter.
The Distance Matrix Approach
We can easily calculate the minimum distance vector using a symmetric distance matrix. After calculating the matrix, we only need to find the minimum column values, ignoring the diagonal, which is zero.
The function below calculates the symmetrical square matrix of distances using as input two arrays, one for latitudes and the other for longitudes. Note how it avoids redundant calculations by filling the matrix along the diagonal. Each step requires a smaller distance vector, improving performance.
The function above uses a vectorized version of the haversine distance calculation, as shown below.
Once we have the square matrix, getting to the list of minimum values is very easy. The two functions below illustrate the process.
This function has the advantage of being fast but does neither help us with visual intuition nor excludes duplicates. We can use an alternative solution from network theory to better understand what happens behind the scenes and get unique distances.
The Network Theory Approach
We can think of the square distance matrix calculation as the network representation of distances between all the cluster locations. Each node is a location, while the edges between them carry the symmetric distance as their weight.
Now, we select the shortest edge for each node and remove all others (simplistically speaking). The resulting network is a tree as it connects all the nodes and has no cycles. If the edges carry the shortest possible weight (the geographic distance between locations), we call it a minimum spanning tree.
We can calculate this tree by using the NetworkX package. The code below illustrates the calculation process.
Unfortunately, this method is an order of magnitude slower than the matrix-based one but excludes duplicate distances and provides a graphical representation of the minimum spanning tree. We can overlay the tree on a map and connect dots, as seen in this article’s main picture.
What Distribution?
We now have two ways to calculate the list of minimum distances. The ensuing discussion uses the minimum spanning tree approach.
To infer the typical distribution of distances, we use the Fitter package, as illustrated by notebook 12. In this notebook, we iterate through all clusters, calculate their minimum spanning tree distances, and determine the most likely distribution. The table below displays the results of our dataset.
As displayed in the table above, most clusters follow a Log-Normal distribution. The second most frequent distribution is the Gamma, similar to the Log-Normal.
The Proposed Metric
Using these findings, I decided to base the metric on the Log-Normal parameters: m and s (calculated following Brilliant’s published formulas).
We calculate the values of 𝝁 and 𝜎 as the average and standard deviation of the sampled data. As the metric, I decided to use the following quantity.
The function that calculates the cluster metric from the list of minimum distances is quite simple.
We can now test this approach by drawing the geofences created with this metric and cluster locations on a map.
It’s time to look at how this metric behaves when drawing HDBSCAN-generated cluster borders. We start with the inflated concave hull in the picture below.
The bubble-shaped geofence uses the same cluster metric as the radius for the merged circles, centered around each cluster location. The resulting shape, pictured below, exaggerates the buffer area around the points. Note that this derives from the method and not from the metric.
Finally, we test the H3-based approach, using the cluster metric to derive the hexagon detail level. Due to the fixed geospatial nature of H3 hexagons, the buffer zone might not work as expected, so you might have to tweak the calculations using a multiplicative factor. The geofence below uses a multiplicative factor of two.
The image below shows the same cluster with a larger geofence, calculated using a larger multiplier (three, in this case).
In this article, I tried to derive a metric to mimic the effect of DBSCAN’s ε parameter on HDBSCAN-generated clusters. My proposal for this metric is to infer the typical statistical distribution of the minimum spanning tree distances for the set of clusters and then use a simple function of the mean and standard deviation. I intended to find a robust metric to apply to some of the geofence generation algorithms that I successfully used in the past with DBSCAN. The lack of the ε parameter, defined in advance for DBSCAN, means there is no apparent support for a metric. The solution is neither unique nor necessarily the best but seems appropriate for the task at hand, especially when using the inflated concave hull and H3 approaches.
joaofig/ved-explore: Exploration of the Vehicle Energy Dataset (github.com)
The hdbscan Clustering Library — hdbscan 0.8.1 documentation
H3: Uber’s Hexagonal Hierarchical Spatial Index | Uber Blog