7 Evaluation Metrics for Clustering Algorithms | by Kay Jan Wong | Dec, 2022

By Jessie Hobb On Dec 9, 2022

In-depth explanation with Python examples of unsupervised learning evaluation metrics

In Supervised Learning, the labels are known and evaluation can be done by calculating the degree of correctness by comparing the predicted values against the labels. However, in Unsupervised Learning, the labels are not known, which makes it hard to evaluate the degree of correctness as there is no ground truth.

That being said, it is still consistent that a good clustering algorithm has clusters that have small within-cluster variance (data points in a cluster are similar to each other) and large between-cluster variance (clusters are dissimilar to other clusters).

There are two types of evaluation metrics for clustering,

Extrinsic Measures: These measures require ground truth labels, which may not be available in practice
Intrinsic Measures: These measures do not require ground truth labels (applicable to all unsupervised learning results)

This article will discuss the various evaluation metrics for clustering algorithms, focusing on their definition, intuition, when to use them, and how to implement them with the sklearn library. Formulas for all the algorithms can be found in the Appendix section of the article.

Note: I checked all the algorithms and formulas by hand, do reach out if you need the calculations! Otherwise, for each algorithm, the variables and formula are explained in words and equations for better understanding — more in the Appendix 🙂

Extrinsic Measures

Intrinsic Measures

Extrinsic Measures require ground truth labels, which may not be available or require manual labelling by humans.

Rand Index (RI, ARI) measures the similarity between the cluster assignments by making pair-wise comparisons. A higher score signifies higher similarity.

For each pair, it is considered correct if the pair is predicted to be in the same cluster when they are in the same cluster (somewhat like “true positive”) and correct if the pair is predicted to be in different clusters when they are indeed in different clusters (somewhat like “true negative”).

Fig 1: Formula for Rand Index — Image by author

However, Rand Index does not consider chance; if the cluster assignment was random, there can be many cases of “true negative” by fluke. Ideally, we want random (uniform) label assignments to have scores close to 0, and this requires adjusting for chance.

Adjusted Rand Index (ARI) adjusts for chance by discounting a chance normalization term. The formula for ARI can be found in this article’s Appendix (Fig 2) to avoid visual clutter.

When to use Rand Index

You want interpretability: RI is intuitive and easy to understand.
You are unsure about cluster structure: RI and ARI do not make assumptions about the cluster structure and can be applied to all clustering algorithms.
You want a basis for comparison: RI is bounded between the [0, 1] range, and ARI is bounded between the [-1, 1] range. The bounded range makes it easy to compare the scores between different algorithms.

When to NOT use Rand Index

You do not have the ground truth labels: RI and ARI are extrinsic measures and require ground truth cluster assignments.

Implementing Rand Index

from sklearn.metrics import rand_score, adjusted_rand_score
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]RI = rand_score(labels, labels_pred)
ARI = adjusted_rand_score(labels, labels_pred)

Mutual Information (MI, NMI, AMI) measures the agreement between the cluster assignments. A higher score signifies higher similarity.

The degree of agreement between clusters is computed by joint and marginal probabilities. There are two variations to Mutual Information: Normalized Mutual Information (NMI) and Adjusted Mutual Information (AMI).

Normalized MI is MI divided by average cluster entropies and is often used in literature, while Adjusted MI is Normalized MI adjusted for chance by discounting a chance normalization term. The formula for MI, NMI, and AMI can be found in this article’s Appendix (Fig 3 and 4).

When to use Mutual Information

You want a basis for comparison: MI, NMI, and AMI have an upper bound of 1.

When to NOT use Mutual Information

You do not have the ground truth labels: MI, NMI, and AMI are extrinsic measures and require ground truth cluster assignments.

Implementing Mutual Information

from sklearn.metrics import (
mutual_info_score,
normalized_mutual_info_score,
adjusted_mutual_info_score,
)
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]MI = mutual_info_score(labels, labels_pred)
NMI = normalized_mutual_info_score(labels, labels_pred)
AMI = adjusted_mutual_info_score(labels, labels_pred)

V-measure measures the correctness of the cluster assignments using conditional entropy analysis. A higher score signifies higher similarity.

Two metrics measure the correctness of cluster assignments, which are intuitive as they follow from supervised learning.

Homogeneity: Each cluster contains only members of a single class (somewhat like “precision”)
Completeness: All members of a given class are assigned to the same cluster (somewhat like “recall”)

V-measure is the harmonic mean of homogeneity and completeness measure, similar to how the F-score is a harmonic mean of precision and recall. The formula for homogeneity, completeness, and V-measure can be found in this article’s Appendix (Fig 5).

When to use V-measure

You want interpretability: V-measure is intuitive and easy to understand in terms of homogeneity and completeness.
You are unsure about cluster structure: V-measure does not make assumptions about the cluster structure and can be applied to all clustering algorithms.
You want a basis for comparison: Homogeneity, completeness, and V-measure are bounded between the [0, 1] range. The bounded range makes it easy to compare the scores between different algorithms.

When to NOT use V-measure

You do not have the ground truth labels: Homogeneity, completeness, and V-measure are extrinsic measures and require ground truth cluster assignments.
Your sample size is less than 1000 and the number of clusters is more than 10: V-measure does not adjust for chance. This means that random labelling would not yield zero scores especially if the number of clusters is large.

Implementing V-measure

from sklearn.metrics import (
homogeneity_score,
completeness_score,
v_measure_score,
)
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]HS = homogeneity_score(labels, labels_pred)
CS = completeness_score(labels, labels_pred)
V = v_measure_score(labels, labels_pred, beta=1.0)

Fowlkes-Mallows Scores measure the correctness of the cluster assignments using pairwise precision and recall. A higher score signifies higher similarity.

While V-measure is a harmonic mean between homogeneity (“precision”) and completeness (“recall”), Fowlkes-Mallows Index (FMI) is the geometric mean of pairwise precision and recall, using True Positive (TP), False Positive (FP), and False Negative (FN).

We Fowlkes-Mallows Score does not take into account True Negative (TN), it will not be affected by chance and there is no need for chance adjustments, unlike Rand Index and Mutual Information.

The definition of TP, FP, FN, and the formula for the Fowlkes-Mallows Index (FMI) is found in this article’s Appendix (Fig 6 and 7).

When to use Fowlkes-Mallows Scores

You are unsure about cluster structure: Fowlkes-Mallows Score does not make assumptions about the cluster structure and can be applied to all clustering algorithms.
You want a basis for comparison: Fowlkes-Mallows Score has an upper bound of 1. The bounded range makes it easy to compare the scores between different algorithms.

When to NOT use Fowlkes-Mallows Scores

You do not have the ground truth labels: Fowlkes-Mallows Scores are extrinsic measures and require ground truth cluster assignments.

Implementing Fowlkes-Mallows Scores

from sklearn.metrics import fowlkes_mallows_score
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]FMI = fowlkes_mallows_score(labels, labels_pred)

Intrinsic Measures do not require ground truth labels, making them applicable to all clustering results

Silhouette Coefficient measures the between-cluster distance against within-cluster distance. A higher score signifies better-defined clusters.

The Silhouette Coefficient of a sample measures the average distance of a sample with all other points in the next nearest cluster against all other points in its cluster. A higher ratio signifies the cluster is far away from its nearest cluster and that the cluster is more well-defined.

The Silhouette Coefficient for a set of samples takes the average Silhouette Coefficient for each sample. The formula is found in this article’s Appendix (Fig 8).

When to use Silhouette Coefficient

You want interpretability: The Silhouette Coefficient is intuitive and easy to understand.
You want a basis for comparison: Silhouette Coefficient has a range of [-1, 1], from incorrect clustering to highly dense clustering, with 0 being overlapping clusters. The bounded range makes it easy to compare the scores between different algorithms.
You define good clusters as well-defined clusters: Silhouette Coefficient follows the general definition of good clusters being dense and well-separated.

When to NOT use Silhouette Coefficient

You are comparing different types of clustering algorithms: Silhouette Coefficient scores tend to be higher for density-based clustering algorithms, and would be unfair to compare against other types of clustering algorithms.

Implementing Silhouette Coefficient

from sklearn.metrics import silhouette_score
data = [
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
]
clusters = [1, 1, 2, 2, 3, 3]s = silhouette_score(data, clusters, metric="euclidean")

Calinski-Harabasz Index measures the between-cluster dispersion against within-cluster dispersion. A higher score signifies better-defined clusters.

The Calinski-Harabasz Index, or Variance Ratio Criterion, measures the sum of between-cluster dispersion against the sum of within-cluster dispersion, where dispersion is the sum of distance squared.

A higher ratio signifies the cluster is far away from its other clusters and that the cluster is more well-defined. The formula is found in this article’s Appendix (Fig 9).

When to use Calinski-Harabasz Index

You want efficiency: The Calinski-Harabasz Index is fast to compute
You define good clusters as well-defined clusters: Calinski-Harabasz Index follows the general definition of good clusters being dense and well-separated.

When to NOT use Calinski-Harabasz Index

You are comparing different types of clustering algorithms: Calinski-Harabasz Index tends to be higher for density-based clustering algorithms, and would be unfair to compare against other types of clustering algorithms.

Implementing Calinski-Harabasz Index

from sklearn.metrics import calinski_harabasz_score
data = [
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
]
clusters = [1, 1, 2, 2, 3, 3]s = calinski_harabasz_score(data, clusters)

Davies-Bouldin Index measures the size of clusters against the average distance between clusters. A lower score signifies better-defined clusters.

The Davies-Bouldin Index measures the average similarity between clusters, where similarity compares the size of clusters against the between-cluster distance.

A lower score means that the cluster is relatively small compared to the distance to another cluster, hence well-defined. The formula is found in this article’s Appendix (Fig 10).

When to use Davies-Bouldin Index

You want interpretability: Davies-Bouldin Index is easier to compute than Silhouette scores and it uses point-wise distances.

When to NOT use Davies-Bouldin Index

You are comparing different types of clustering algorithms: Davies-Bouldin Index tends to be higher for density-based clustering, and would be unfair to compare against other types of clustering algorithms.
You want other distance measures besides Euclidean distance: Size of clusters, computed by centroid distance, limits distance metric to Euclidean space.

Implementing Davies-Bouldin Index

from sklearn.metrics import davies_bouldin_score
data = [
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
]
clusters = [1, 1, 2, 2, 3, 3]DB = davies_bouldin_score(data, clusters)

The characteristics of each algorithm are summarized below:

Table 1: Characteristics of Clustering Algorithms — Image by author

I hope you have understood more about different ways to evaluate a clustering algorithm — using intrinsic and extrinsic measures depending if you have the ground-truth labels. In practice, we may care more about whether the clusters make business sense rather than the distance within or between clusters through statistical measures. Nonetheless, these evaluation metrics are still good to know!

The formula for Rand Index (ARI)

Chance normalization term considers the number of pairs occurring in the same cluster in actual cluster assignment and predicted cluster assignment.

Fig 2: Formula for chance normalization term and Adjusted Rand Index — Image by author

The formula for Mutual Information (MI, NMI, AMI)

The formulas for joint and marginal probabilities and entropies form the basis for calculating Mutual Information.

Fig 3: Formula for joint and marginal probabilities and entropies — Image by author

Fig 4: Formula for MI, Normalized MI, and Adjusted MI — Image by author

The formula for V-measure

Fig 5: Formula for Homogeneity, Completeness, and V-measure — Image by author

The formula for Fowlkes-Mallows Scores

Fowlkes-Mallows Index (FMI) is calculated using True Positive, False Positive, and False Negative. The definition for TP, FP, and FN is done by counting the number of pairwise points if they are allocated in the same or different cluster for the predicted and actual label.

Fig 6: Definition of TP, FP, TN, FN — Image by author

Fig 7: Formula for Fowlkes-Mallows Score — Image by author

The formula for Silhouette Coefficient

Note that for the calculation of b, it considers the next nearest cluster to the sample itself and not the next nearest cluster to the assigned cluster.

Fig 8: Formula for Silhouette Coefficient — Image by author

Calinski-Harabasz Index

Davies-Bouldin Index