From Data to Clusters: When is Your Clustering Good Enough? | by Erdogan Taskesen | Apr, 2023

By Jessie Hobb On Apr 26, 2023

To determine data-driven clusters, that may behold new or unknown information, we carefully need to perform four constitutive steps to go from the input data set to sensible clusters. Libraries such as distfit and clusteval can help in this process.

Coming up with a sensible grouping of samples requires more than blindly running a clustering algorithm.

Step 1: Investigate the underlying distribution of the data.

Investigating the underlying distribution of the data is an important step because clustering algorithms rely on the statistical properties of the data to identify patterns and group similar data points together. By understanding the distribution of the data, such as its mean, variance, skewness, and kurtosis, we can make informed decisions about which clustering algorithm to use and how to set its parameters to achieve optimal results. Furthermore, investigating the data distribution can provide insights into the appropriate normalization or scaling techniques to be applied before clustering. While supervised approaches, like tree models, have the capability to handle mixed data sets, clustering algorithms on the other hand are designed to work with homogeneous data. This means that all variables should have similar types or units of measurement. Normalization or scaling is thus an important step as clustering algorithms group data points based on their similarity using a metric. More details about investigating the underlying data distribution can be read here [1]:

Step 2: Make an educated guess of the cluster density and the expected cluster sizes.

Setting expectations about cluster densities, shape, and the number of clusters will help to select the appropriate clustering algorithm and parameter settings to achieve the desired outcome. In addition, setting expectations can provide more confidence and validity of the results when interpreting and communicating the clustering results to stakeholders in a meaningful way. For example, if the aim was to identify rare anomalies in a dataset, and the clustering results produce a small cluster with very low density, it could potentially indicate the presence of such anomalies. However, it is not always possible to set expectations about the number of clusters or the density. Then we need to select clustering method(s) solely on the mathematical properties that match with the statistical properties of the data and the research aim.

Step 3: Select the Clustering Method.

The selection of a clustering method depends on steps 1 to 4 but we should also include factors such as scalability, robustness, and ease of use. For example, in a production setting, we may need different properties than for experimental use cases. There are several popular clustering methods, such as K-means, Hierarchical, and Density-based clustering algorithms, each with its own assumptions, advantages, and limitations (see a summary below). After the selection of the clustering method, we can start clustering the data and evaluate the performance.

K-means assumes that clusters are spherical, equally sized, and have similar densities. It requires specifying the number of clusters in advance (k). Note the detection for the optimal number of clusters is automatically detected in the clusteval library.

Hierarchical clustering builds a tree-like structure of clusters by recursively merging clusters based on distance or similarity metrics. It is agglomerative (bottom-up) and does not require specifying the number of clusters in advance.

Density-based clustering algorithms, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) group together data points that are densely packed and have lower densities in between clusters. They do not assume any specific shape or size of clusters and can identify clusters of arbitrary shapes and sizes. Density-based clustering is particularly useful for identifying clusters of varying densities and detecting outliers as noise points. However, they require tuning of hyperparameters, such as density threshold, and are sensitive to the choice of parameters.

Different clustering methods can result in different partitioning of the samples and thus different groupings since each method implicitly impose a structure on the data.

Step 4: Cluster Evaluation.

Cluster evaluation is to assess the clustering tendency, quality, and the optimal number of clusters. There are various cluster evaluation methods for which the most popular are incorporated in the clusteval library, i.e., Silhouette score, Davies-Bouldin index, and the derivative (or Elbow) method. The difficulty in using such techniques is that the clustering step and its evaluation are often intertwined for which the clusteval library internally handles this. In the next section, I will dive into the various methods and test their performances.

Cluster evaluation should be performed during the clustering process to assign scores to each clustering and enable meaningful comparisons between the results.

The Next Step: From Clusters to Insights.

The step after finding the optimal number of clusters is to determine the driving features behind the clusters. A great manner to do this is by using enrichment analysis with methods such as HNET [2], where we can use the hypergeometric test and Mann-withney U test to test for significant associations between the cluster labels and the features. More details can be found here: