How to Perform KMeans Clustering Using Python | by Zoumana Keita | Jan, 2023


Image by Guillermo Ferla on Unsplash

Imagine that you are a Data Scientist working for a retail company and your boss requests for the customers’ segmentation into the following groups: low, average, medium, or platinum customers based on spending behavior for targeted marketing purposes and product recommendations.

Knowing that there is no such historical label associated with those customers, how is it possible to categorize them?

This is where clustering can help. It is an unsupervised machine-learning technique used to group unlabeled data into similar categories or clusters.

This conceptual article will focus more on the K-means clustering approach, one of the many techniques in unsupervised machine learning. It will start by providing an overview of what K-means clustering is, before walking you through a step-by-step implementation in Python using the popular Scikit-learn library.

The idea behind K-means clustering is to divide a dataset into a specified number of clusters (k), where all the points within the same cluster are similar to one another, and those in different clusters are different.

It starts by randomly assigning each data point to a cluster, and then it iteratively improves the clusters by moving the data points to the cluster center that is closest to them. This logic continues until the cluster assignments stop changing, or a maximum number of iterations is reached.

What are the key steps of K-means clustering?

Below are the five main steps of the k-means algorithm:

Five main steps in K-Means Clustering (Image by Author)

Below we can see an illustration of K-means where the convergence is reached at the 14th iteration.

Convergence of k-means clustering algorithm (Image from Wikipedia)

Now that we have an understanding of how k-means works, let’s see how to implement it in Python.

To begin, you need to install the following libraries:

  • Pandas for loading the data frame.
  • Matplotlib for data visualizations.
  • Scikit-learn to use the Kmean algorithm.

The installation can be performed as follows using pip, the Python package manager:

# Scikit Learn
pip install scikit-learn

# Pandas
pip install pandas

# Matplotlib
pip install matplotlib

Import libraries and load the data

Now that you have an understanding of the K-means clustering algorithm, let’s dive deep. We will be using the Mall Customer Data freely available on Kaggle.

It contains for each customer this basic information: ID, Gender, Age, Income, and Annual Spending score

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# load the customer data into a DataFrame
customer_df = pd.read_csv('customer_data.csv')

# Check the first 5 rows
customer_df.head()

The previous .head() statement should generate the following result:

First 5 rows of the Customer data (Image by Author)

Explore the data

Let’s have a quick statistical and visual understanding of the data before any further implementation of the algorithm.

plt.scatter(customer_df["Age"], 
customer_df["Spending Score (1-100)"])

plt.xlabel("Age")
plt.ylabel("Spending Score (1-100)")

Scatter Plot of Customers’ Age and their Spending Score (Image By Author)
 plt.scatter(customer_df["Age"], 
customer_df["Annual Income (k$)"])

plt.xlabel("Age")
plt.ylabel("Annual Income (k$)")

Scatter Plot of Customers’ Age and their Annual Income (Image By Author)
plt.scatter(customer_df["Spending Score (1-100)"], 
customer_df["Annual Income (k$)"])

plt.xlabel("Spending Score (1-100)")
plt.ylabel("Annual Income (k$)")

Scatter Plot of Customers’ Spending Score and their Annual Income (Image By Author)

All these plots have different results, thus leading to different interpretations. For instance, the first plot seems to propose two different groups of customers, whereas the second one is not obvious and from the last one it looks like there are five different groups. This is where Kmeans will be helpful in efficiently generating the correct groups/clusters.

Also, we notice from the following result that there are no missing values in the data.

# Check for null values
customer_df.isnull().sum()
No null values in the data (Image by Author)

Get the relevant columns for clustering

Not all the columns are relevant for the clustering. In this example, we will use the numerical ones: Age, Annual Income, and Spending Score

relevant_cols = ["Age", "Annual Income (k$)", "Spending Score (1-100)"]

customer_df = customer_df[relevant_cols]

Data Transformation

Kmeans is sensitive to the measurement units and scales of the data. It is better to standardize the data first to tackle this issue. Also, this is a common practice prior to implementing any machine learning model.

Basically, the standardization substracts the mean of any feature from the actual values of that feature and divides the feature’s standard deviation.

The process is straightforward and is done as follows in Python:

  • Use the StandardScaler class from the sklearn.preprocessing module.
  • Apply the fit() method to compute the mean and standard deviation of the features.
  • Then finally use the transform() to scale the data.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(customer_df)

scaled_data = scaler.transform(customer_df)

Determine the best number of cluster

A clustering model will not be relevant if we fail to identify the correct number of clusters to consider. Multiple techniques exist in the literature. We are going to consider the Elbow method, which is a heuristic method, and one of the widely used to find the optimal number of clusters.

The first helper function creates for each value of K the corresponding KMeans model and saves its inertia along with the actual K value.

The second function uses those inertias and K values to generate the final Elbow plot.

def find_best_clusters(df, maximum_K):

clusters_centers = []
k_values = []

for k in range(1, maximum_K):

kmeans_model = KMeans(n_clusters = k)
kmeans_model.fit(df)

clusters_centers.append(kmeans_model.inertia_)
k_values.append(k)

return clusters_centers, k_values

def generate_elbow_plot(clusters_centers, k_values):

figure = plt.subplots(figsize = (12, 6))
plt.plot(k_values, clusters_centers, 'o-', color = 'orange')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Cluster Inertia")
plt.title("Elbow Plot of KMeans")
plt.show()

Now, we can apply the functions to the dataset using a maximum K value of 12 and get the final result.

clusters_centers, k_values = find_best_clusters(scaled_data, 12)

generate_elbow_plot(clusters_centers, k_values)

Below is the final result.

Graphic for finding the optimal number of clusters (Image by Author)

From the plot, we notice that the cluster inertia decreases as we increase the number of clusters. Also the drop the inertia is minimal after K=5 hence 5 can be considered as the optimal number of clusters.

Create the final KMeans model

Once we have determined the optimal number of clusters, we can finally apply the KMeans model to that value as follows.

kmeans_model = KMeans(n_clusters = 5)

kmeans_model.fit(scaled_data)

We can access the cluster to which each data point belongs by using the .labels_ attribute. Let’s create a new column corresponding to those values.

customer_df["clusters"] = kmeans_model.labels_

customer_df.head()

Final dataset after clustering (Image by Author)

By looking at the first 5 customers, we can observe that the first two and last two have been assigned to the first cluster (cluster #1), whereas the third customer belongs to the third cluster (cluster #3)

Visualize the clusters

Now that we have generated the clusters, the final step is to visualize them.

plt.scatter(customer_df["Spending Score (1-100)"], 
customer_df["Annual Income (k$)"],
c = customer_df["clusters"])
Clusters visualization (Image by Author)

The KMeans clustering seems to generate a pretty good result, and the five clusters are well separated from each other, even though there is a slight overlap between the purple and the yellow clusters.

The general observation is that :

  • Customers on the top left have a low spending score and a high annual income. A good marketing strategy could be implemented to target those customers so that they can spend more.
  • On the other hand, customers on the bottom left have a low annual income and also spends less, which makes sense, because they are trying to adjust their spending habit to their budget.
  • The top right customers are similar to the bottom left, the difference is that they have enough budget to spend.
  • Finally, the yellow group of customers spends beyond their budget.

Congrats, you have learned how to perform KMeans clustering using Python. I hope you’ve gained the required skills to efficiently analyze your unlabeled datasets.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on my social networks. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

The source code of the blog is available on GitHub.


Image by Guillermo Ferla on Unsplash

Imagine that you are a Data Scientist working for a retail company and your boss requests for the customers’ segmentation into the following groups: low, average, medium, or platinum customers based on spending behavior for targeted marketing purposes and product recommendations.

Knowing that there is no such historical label associated with those customers, how is it possible to categorize them?

This is where clustering can help. It is an unsupervised machine-learning technique used to group unlabeled data into similar categories or clusters.

This conceptual article will focus more on the K-means clustering approach, one of the many techniques in unsupervised machine learning. It will start by providing an overview of what K-means clustering is, before walking you through a step-by-step implementation in Python using the popular Scikit-learn library.

The idea behind K-means clustering is to divide a dataset into a specified number of clusters (k), where all the points within the same cluster are similar to one another, and those in different clusters are different.

It starts by randomly assigning each data point to a cluster, and then it iteratively improves the clusters by moving the data points to the cluster center that is closest to them. This logic continues until the cluster assignments stop changing, or a maximum number of iterations is reached.

What are the key steps of K-means clustering?

Below are the five main steps of the k-means algorithm:

Five main steps in K-Means Clustering (Image by Author)

Below we can see an illustration of K-means where the convergence is reached at the 14th iteration.

Convergence of k-means clustering algorithm (Image from Wikipedia)

Now that we have an understanding of how k-means works, let’s see how to implement it in Python.

To begin, you need to install the following libraries:

  • Pandas for loading the data frame.
  • Matplotlib for data visualizations.
  • Scikit-learn to use the Kmean algorithm.

The installation can be performed as follows using pip, the Python package manager:

# Scikit Learn
pip install scikit-learn

# Pandas
pip install pandas

# Matplotlib
pip install matplotlib

Import libraries and load the data

Now that you have an understanding of the K-means clustering algorithm, let’s dive deep. We will be using the Mall Customer Data freely available on Kaggle.

It contains for each customer this basic information: ID, Gender, Age, Income, and Annual Spending score

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# load the customer data into a DataFrame
customer_df = pd.read_csv('customer_data.csv')

# Check the first 5 rows
customer_df.head()

The previous .head() statement should generate the following result:

First 5 rows of the Customer data (Image by Author)

Explore the data

Let’s have a quick statistical and visual understanding of the data before any further implementation of the algorithm.

plt.scatter(customer_df["Age"], 
customer_df["Spending Score (1-100)"])

plt.xlabel("Age")
plt.ylabel("Spending Score (1-100)")

Scatter Plot of Customers’ Age and their Spending Score (Image By Author)
 plt.scatter(customer_df["Age"], 
customer_df["Annual Income (k$)"])

plt.xlabel("Age")
plt.ylabel("Annual Income (k$)")

Scatter Plot of Customers’ Age and their Annual Income (Image By Author)
plt.scatter(customer_df["Spending Score (1-100)"], 
customer_df["Annual Income (k$)"])

plt.xlabel("Spending Score (1-100)")
plt.ylabel("Annual Income (k$)")

Scatter Plot of Customers’ Spending Score and their Annual Income (Image By Author)

All these plots have different results, thus leading to different interpretations. For instance, the first plot seems to propose two different groups of customers, whereas the second one is not obvious and from the last one it looks like there are five different groups. This is where Kmeans will be helpful in efficiently generating the correct groups/clusters.

Also, we notice from the following result that there are no missing values in the data.

# Check for null values
customer_df.isnull().sum()
No null values in the data (Image by Author)

Get the relevant columns for clustering

Not all the columns are relevant for the clustering. In this example, we will use the numerical ones: Age, Annual Income, and Spending Score

relevant_cols = ["Age", "Annual Income (k$)", "Spending Score (1-100)"]

customer_df = customer_df[relevant_cols]

Data Transformation

Kmeans is sensitive to the measurement units and scales of the data. It is better to standardize the data first to tackle this issue. Also, this is a common practice prior to implementing any machine learning model.

Basically, the standardization substracts the mean of any feature from the actual values of that feature and divides the feature’s standard deviation.

The process is straightforward and is done as follows in Python:

  • Use the StandardScaler class from the sklearn.preprocessing module.
  • Apply the fit() method to compute the mean and standard deviation of the features.
  • Then finally use the transform() to scale the data.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(customer_df)

scaled_data = scaler.transform(customer_df)

Determine the best number of cluster

A clustering model will not be relevant if we fail to identify the correct number of clusters to consider. Multiple techniques exist in the literature. We are going to consider the Elbow method, which is a heuristic method, and one of the widely used to find the optimal number of clusters.

The first helper function creates for each value of K the corresponding KMeans model and saves its inertia along with the actual K value.

The second function uses those inertias and K values to generate the final Elbow plot.

def find_best_clusters(df, maximum_K):

clusters_centers = []
k_values = []

for k in range(1, maximum_K):

kmeans_model = KMeans(n_clusters = k)
kmeans_model.fit(df)

clusters_centers.append(kmeans_model.inertia_)
k_values.append(k)

return clusters_centers, k_values

def generate_elbow_plot(clusters_centers, k_values):

figure = plt.subplots(figsize = (12, 6))
plt.plot(k_values, clusters_centers, 'o-', color = 'orange')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Cluster Inertia")
plt.title("Elbow Plot of KMeans")
plt.show()

Now, we can apply the functions to the dataset using a maximum K value of 12 and get the final result.

clusters_centers, k_values = find_best_clusters(scaled_data, 12)

generate_elbow_plot(clusters_centers, k_values)

Below is the final result.

Graphic for finding the optimal number of clusters (Image by Author)

From the plot, we notice that the cluster inertia decreases as we increase the number of clusters. Also the drop the inertia is minimal after K=5 hence 5 can be considered as the optimal number of clusters.

Create the final KMeans model

Once we have determined the optimal number of clusters, we can finally apply the KMeans model to that value as follows.

kmeans_model = KMeans(n_clusters = 5)

kmeans_model.fit(scaled_data)

We can access the cluster to which each data point belongs by using the .labels_ attribute. Let’s create a new column corresponding to those values.

customer_df["clusters"] = kmeans_model.labels_

customer_df.head()

Final dataset after clustering (Image by Author)

By looking at the first 5 customers, we can observe that the first two and last two have been assigned to the first cluster (cluster #1), whereas the third customer belongs to the third cluster (cluster #3)

Visualize the clusters

Now that we have generated the clusters, the final step is to visualize them.

plt.scatter(customer_df["Spending Score (1-100)"], 
customer_df["Annual Income (k$)"],
c = customer_df["clusters"])
Clusters visualization (Image by Author)

The KMeans clustering seems to generate a pretty good result, and the five clusters are well separated from each other, even though there is a slight overlap between the purple and the yellow clusters.

The general observation is that :

  • Customers on the top left have a low spending score and a high annual income. A good marketing strategy could be implemented to target those customers so that they can spend more.
  • On the other hand, customers on the bottom left have a low annual income and also spends less, which makes sense, because they are trying to adjust their spending habit to their budget.
  • The top right customers are similar to the bottom left, the difference is that they have enough budget to spend.
  • Finally, the yellow group of customers spends beyond their budget.

Congrats, you have learned how to perform KMeans clustering using Python. I hope you’ve gained the required skills to efficiently analyze your unlabeled datasets.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on my social networks. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

The source code of the blog is available on GitHub.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsclusteringJanKeitaKMeansmachine learningPerformpythonTechnoblenderZoumana
Comments (0)
Add Comment