Cluster Analysis for Aspiring Data Scientists | by Alex Vamvakaris | Feb, 2023

By Jessie Hobb On Feb 23, 2023

A step-by-step case study of how data scientists approach and execute a cluster analysis

Going back to my time as an undergraduate student in statistics, there is a day that stands out. It was the first day of the multivariate analysis module. This class was new at the time, and to our surprise, the professor decided to do something different. Instead of going through the agenda for the semester, he closed the lights and announced that today we would learn about the module differently — by watching the pilot episode of the series “Numb3rs”.

The series focuses on FBI Special Agent Don Eppes and his brother, Charles Eppes, a brilliant mathematician who uses data science to ferret out the most tricky criminals. More specifically, in the pilot episode, Charles uses cluster analysis to find out the criminal’s point of origin. Needless to say, we were all hooked.

Fast forwarding to today, I have been through different industries, from eCommerce retail to consultancy and gaming, and the one constant desire from all business stakeholders was to segment (cluster) their customers or users. It seems that clustering piques the interest of both data scientists and business stakeholders (as well as movie audiences).

In this article, I have created a guide for aspiring data scientists that want to add cluster analysis to their arsenal and get one step closer to landing their first data science job. The guide will be structured in two parts:

Introduction to Clustering: Understanding the three building blocks of cluster analysis
A Step-by-Step Case Study of Clustering in R: Visualizing the data, data processing, computing similarity matrix (distances), selecting the number of clusters, and describing results

1.1. What is the Goal of Clustering?

Clustering is used to group entities based on a defined set of characteristics. For example, we can use clustering to group customers (entities) based on their shopping behavior (characteristics).

1.2. How does Clustering Works?

In supervised machine-learning techniques (like linear regression), the algorithm is trained on a labeled dataset with the goal of maximizing prediction accuracy on new unlabeled data. For example, an algorithm can be trained on a labeled dataset that includes characteristics of houses (square footage, number of bedrooms, etc.) and their corresponding label (sales price) with the goal of creating a model that predicts with high accuracy the sales price of a new house, based on its respective characteristics.

Clustering, on the other hand, is an unsupervised machine-learning technique. The dataset has no label (hence the name unsupervised). Instead, the goal is to create a new label in the form of a cluster assignment. Every cluster analysis will be different, but in each case, there are three main building blocks or steps if you prefer:

Create a dataset that is unique by row for the entity you want to cluster. So if you want to cluster customers, each row must represent a different customer. For each one of these rows, you will have attributes (columns) that describe specific characteristics, like revenue in the past year, favorite product, number of days since last purchase, etc.
Compute the similarity between each row and all others in your dataset (based on the available attributes). So we will have a similarity value between the first and the second row, the first and third row, the second and third row, and so on. The most frequently used similarity functions are distances. We will go through these in more detail in the case study of the next section as they are better explained using an example
Input the similarity matrix into a clustering algorithm. There are many available in R and Python, like k-means, hierarchical, and PAM. Clustering algorithms have a simple purpose. Assign observations into clusters so that observations within a cluster are as similar as possible and observations between different clusters are as dissimilar as possible. In the end, you will have a cluster assignment for each row (observation)

Even though I couldn’t find the dataset used in the pilot episode of the series “Numb3rs”, I thought it would be fitting to use something similar for our case study, so I picked the US Arrests dataset, loaded from the datasets package (part of the base library in R).

The dataset contains 50 rows and 4 attributes. Each row is a US state (entity). The four attributes describe the following characteristics of each state:

Murder: Number of murder arrests per 100000 residents in 1975
Assault: Number of assault arrests per 100000 residents in 1975
Rape: Number of rape arrests per 100000 residents in 1975
UrbanPop: Percent of the population living in urban areas in 1975
We also have the state name as row names. This will not be used as input for the clustering algorithm

##############################
# Loading libraries
##############################
library("dplyr")        # summarizing data
library("ggplot2")      # visualization
library("cluster")      # gower distance and PAM
library("ggalluvial")   # alluvial plots##############################
# Loading the data set
##############################
data("USArrests")
##############################
# Examine data set
##############################
head(USArrests)   # head -> header

Overview of USArrests dataset in R [Image by the author]

2.1. Visualizing the Data

Looking at the box plots below, all four attributes seem approximately symmetric (UrbanPop is slightly right-skewed, and Rape is slightly left-skewed). If an attribute was heavily skewed or if there were outliers, we could consider applying different transformations (such as log transformation). This is not necessary with our dataset.

##############################
# Box plot for each attribute
##############################
USArrests %>% 
select(c(Murder, Assault, Rape, UrbanPop)) %>%
boxplot(.,
boxwex = 0.7, 
alpha = 0.2, 
col = c("red3", "lightgreen", "lightblue", "purple"), 
horizontal = TRUE
)

Box Plot for each attribute in the USArrests dataset [Image by the author]

You can also see that the attribute Assault has much higher values than the other three. It is a good practice in clustering to standardize all attributes to the same scale before computing the similarities. This is done so that attributes on a larger scale do not overcontribute (Assault is in the hundreds, whereas the other three attributes are in the tens). Fortunately, the similarity function we will use takes care of that, so we do not need to do any rescaling at this stage.

2.2. Computing Similarity Matrix

For our case study, we will use the Gower distance from the daisy function (cluster package) in R to compute the similarity matrix. To be more specific, Gower uses dissimilarities as a distance function, but the concept is the same (instead of maximizing similarity, we minimize dissimilarity).

Gower is one of the few distance functions that can work with mixed-type data (numerical and categorical attributes). Another advantage is that Gower scales all distances to be between 0 and 1 (attributes on high scales do not overcontribute).

So for our dataset with n = 50 rows (states), we will have a 50×50 matrix with 1225 unique values (n*(n-1)/2). The diagonal is not needed (distance of a state with itself), and the upper triangle is the same as the lower (the matrix is symmetrical).

##############################
# Get distance
##############################
gower_dist <- 
USArrests %>% 
select(c(Murder, Assault, Rape, UrbanPop)) %>%
daisy(.,metric = "gower")gower_dist

Snapshot of Gower distance matrix [Image by the author]

2.3. Selecting the Number of Clusters

For our case study, we will use the PAM (Partitioning Around Medoids) clustering algorithm and, more specifically, the pam function (cluster package).

PAM (and all other K-medoid algorithms) is a robust alternative to k-means because it uses medoids as cluster centers instead of means. As a result, the algorithm is less sensitive to noise and outliers compared to k-means (but not immune!).

First, we need to select the number of clusters. The approach for determining the number of clusters is pretty straightforward. We will run the PAM algorithm using the Gower similarity matrix we computed in the previous step, and in each run, we will select a different number of clusters (from 2 to 10). Then, we will compute the Average Silhouette Width (ASW) for each run. We will do that using the function below.

##############################
# Function to compute ASW 
##############################
get_asw_using_pam <- function(distance, min_clusters, max_clusters) {
average_sil_width <- c(NA)
for (i in min_clusters:max_clusters){
pam_fit <- 
pam(
distance,
diss = TRUE,
k=i)
average_sil_width[i] <- pam_fit$silinfo$avg.width
}
return(average_sil_width)
}############################
## Get ASW from each run
############################
sil_width <- get_asw_using_pam(gower_dist, 2, 10)

The ASW represents how well each observation fits in its present cluster compared to the closest neighboring cluster. It ranges from -1 to 1, with higher values (closer to 1) indicating better clustering results. We will use the Silhouette Plot (ASW on the y-axis and number of clusters on the x-axis) to visually examine the promising candidates for our clustering.

##############################
# Visualize Silhouette Plot
##############################
silhouette_plot <- sil_width %>% as.data.frame()
names(silhouette_plot)[names(silhouette_plot) == '.'] <- 'sil_width'silhouette_plot %>%
mutate(number = c(1:10)) %>%
filter(number > 1) %>%
ggplot(aes(x = number , y = sil_width)) +
geom_line( color="turquoise3", size=1) +
geom_point(color="darkgrey",fill = "black", size=3) +
scale_x_continuous(breaks = seq(2, 10, 1) ) +
ylab("Average Silhouette Width") +
xlab("No of Clusters") +
theme_classic()

Silhouette Plot using PAM and Gower distance [Image by the author]

The solution with the highest ASW is 2 clusters, and after a steep decrease, we have 3 and 4 clusters, and then ASW decreases abruptly again. Our possible candidates are these three options (2, 3, and 4 clusters). Let’s save the clustering assignment from each run as three new attributes (cluster_2, cluster_3, and cluster_4) in the USArrests dataset.

## Set the seed of R's random number generator
## which is useful for creating reproducible simulations
## like in cases of clustering assignment
set.seed(5)##############################
# Saving PAM results
##############################
pam_2 <- pam(gower_dist, diss = TRUE, k= 2 )
pam_3 <- pam(gower_dist, diss = TRUE, k= 3 )
pam_4 <- pam(gower_dist, diss = TRUE, k= 4 )
##############################
# Adding assignment as columns
##############################
USArrests <-
USArrests %>%
mutate(cluster_2 = as.factor(pam_2$clustering)) %>%
mutate(cluster_3 = as.factor(pam_3$clustering)) %>%  
mutate(cluster_4 = as.factor(pam_4$clustering))

2.4. Describing Cluster Solutions

Because clustering is an unsupervised technique, it is important to remember that, in essence, it is exploratory analysis (EDA) on steroids. So unlike supervised techniques, we cannot evaluate the accuracy of our clustering solution (there is no label in the dataset to use for comparison). Instead, the “goodness” of your cluster analysis will be evaluated based on a different criterion:

Are the resulting clusters separated in a way that reflects the aspects that the business stakeholders had in mind?

So the final decision is based on something other than the solution with the highest ASW. It is instead made by examining the characteristics of each clustering solution. You should always pick the clustering that is more actionable to the business than the one with the highest ASW.

First, we want to examine how the different clustering solutions are related. We will use the alluvial plot to visualize the flow of observations (states) between the three runs.

##############################
# Alluvial plot
##############################
aluv <- 
USArrests %>%
group_by(cluster_2, cluster_3, cluster_4) %>%
summarise(Freq = n()) alluvial(
aluv[,1:3],
freq=aluv$Freq, 
border="lightgrey", 
gap.width=0.2,
alpha=0.8,
col =  
ifelse( 
aluv$cluster_4 == 1, "red", 
ifelse(aluv$cluster_4 == 2, "lightskyblue", 
ifelse(aluv$cluster_4 == 3 & aluv$cluster_3 == 2, "lightskyblue4", 
ifelse(aluv$cluster_4 == 3 & aluv$cluster_3 == 2, "purple",              
"orange")))),
cex=0.65
)

Alluvial Plot of cluster assignment for different runs of PAM [Image by the author]

In the alluvial plot above, each column represents a different run, and the different colored ribbons represent fixed blocks of states as they split or get reassigned between the three cluster runs.

For example, the red ribbon represents the states of Alabama, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, and Tennessee (see code below). These states were assigned to the same cluster (“1”) as the states from the blue ribbon in the run with two and three clusters (cluster_2 and cluster_3) but were separated in the run with four clusters (cluster_4).

##################################################
# Filter for the red ribbon (cluster_4 = 1)
##################################################
USArrests %>% filter(cluster_4 == "1")

States in the red ribbon in the alluvial plot [Image by the author]

Next, we want to understand better the characteristics of each cluster and how they are different between the three solutions.

##################################################
# Summarizing (average) attributes, 2 clusters
##################################################
USArrests %>% 
group_by(cluster_2)  %>%
summarise(
count = n(),
mean_murder = mean(Murder),
mean_assault = mean(Assault),
mean_rape = mean(Rape),
mean_urbanpop = mean(UrbanPop)
)##################################################
# Summarizing (average) attributes, 3 clusters
##################################################
USArrests %>% 
group_by(cluster_3)  %>%
summarise(
count = n(),
mean_murder = mean(Murder),
mean_assault = mean(Assault),
mean_rape = mean(Rape),
mean_urbanpop = mean(UrbanPop)
)
##################################################
# Summarizing (average) attributes, 4 clusters
##################################################
USArrests %>% 
group_by(cluster_4)  %>%
summarise(
count = n(),
mean_murder = mean(Murder),
mean_assault = mean(Assault),
mean_rape = mean(Rape),
mean_urbanpop = mean(UrbanPop)
)

Descriptive statistics by cluster run [Image by the author]

Run with two clusters:

Cluster “1” has higher average arrests across all crimes
No observable difference in average urban population %

Run with three clusters:

The three clusters seem to be well separated
Compared to the high (cluster “1”) and low (cluster “2”) arrests from the run with two clusters, we now have high (cluster “1”), medium (cluster “2”), and low arrests (cluster “3”)
Cluster “3” also has a considerably lower average urban population %

Run with four clusters:

The separation is not as clear
The medium (cluster “2”) and low (cluster “3”) clusters from the run with three clusters remained unchanged (only renamed to “3” and “4” respectively)
The high (cluster “1”) from the run with three clusters was split into two clusters. Cluster “1” has a considerably lower average urban population %, with lower arrests for rape and assault

Depending on the need behind the cluster analysis, we would choose the appropriate clustering solution. Remember, not the one with the highest ASW but the one that can be more impactful to the business stakeholders (probably a choice between 3 and 4 clusters).