# How to Perform KMeans Clustering Using Python | by Zoumana Keita | Jan, 2023

## A complete overview of the KMeans clustering and implementation with Python

Imagine that you are a Data Scientist working for a retail company and your boss requests for the customers’ segmentation into the following groups: low, average, medium, or platinum customers based on spending behavior for targeted marketing purposes and product recommendations.

*Knowing that there is no such historical label associated with those customers, how is it possible to categorize them?*

This is where clustering can help. It is an unsupervised machine-learning technique used to group unlabeled data into similar categories or clusters.

This conceptual article will focus more on the K-means clustering approach, one of the many techniques in unsupervised machine learning. It will start by providing an overview of what K-means clustering is, before walking you through a step-by-step implementation in Python using the popular `Scikit-learn`

library.

The idea behind K-means clustering is to divide a dataset into a specified number of clusters (k), where all the points within the same cluster are similar to one another, and those in different clusters are different.

It starts by randomly assigning each data point to a cluster, and then it iteratively improves the clusters by moving the data points to the cluster center that is closest to them. This logic continues until the cluster assignments stop changing, or a maximum number of iterations is reached.

## What are the key steps of K-means clustering?

Below are the five main steps of the k-means algorithm:

Below we can see an illustration of K-means where the convergence is reached at the 14th iteration.

Now that we have an understanding of how k-means works, let’s see how to implement it in Python.

To begin, you need to install the following libraries:

`Pandas`

for loading the data frame.`Matplotlib`

for data visualizations.`Scikit-learn`

to use the`Kmean`

algorithm.

The installation can be performed as follows using `pip`

, the Python package manager:

`# Scikit Learn`

pip install scikit-learn# Pandas

pip install pandas

# Matplotlib

pip install matplotlib

## Import libraries and load the data

Now that you have an understanding of the K-means clustering algorithm, let’s dive deep. We will be using the Mall Customer Data freely available on Kaggle.

It contains for each customer this basic information: `ID`

, `Gender`

, `Age`

, `Income`

, and `Annual Spending score`

`import pandas as pd`

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans# load the customer data into a DataFrame

customer_df = pd.read_csv('customer_data.csv')

# Check the first 5 rows

customer_df.head()

The previous `.head()`

statement should generate the following result:

## Explore the data

Let’s have a quick statistical and visual understanding of the data before any further implementation of the algorithm.

`plt.scatter(customer_df["Age"], `

customer_df["Spending Score (1-100)"])plt.xlabel("Age")

plt.ylabel("Spending Score (1-100)")

` plt.scatter(customer_df["Age"], `

customer_df["Annual Income (k$)"])plt.xlabel("Age")

plt.ylabel("Annual Income (k$)")

`plt.scatter(customer_df["Spending Score (1-100)"], `

customer_df["Annual Income (k$)"])plt.xlabel("Spending Score (1-100)")

plt.ylabel("Annual Income (k$)")

All these plots have different results, thus leading to different interpretations. For instance, the first plot seems to propose two different groups of customers, whereas the second one is not obvious and from the last one it looks like there are five different groups. This is where Kmeans will be helpful in efficiently generating the correct groups/clusters.

Also, we notice from the following result that there are no missing values in the data.

`# Check for null values`

customer_df.isnull().sum()

## Get the relevant columns for clustering

Not all the columns are relevant for the clustering. In this example, we will use the numerical ones: `Age`

, `Annual Income`

, and `Spending Score`

`relevant_cols = ["Age", "Annual Income (k$)", "Spending Score (1-100)"]`customer_df = customer_df[relevant_cols]

## Data Transformation

Kmeans is sensitive to the measurement units and scales of the data. It is better to standardize the data first to tackle this issue. Also, this is a common practice prior to implementing any machine learning model.

Basically, the standardization substracts the mean of any feature from the actual values of that feature and divides the feature’s standard deviation.

The process is straightforward and is done as follows in Python:

- Use the
`StandardScaler`

class from the`sklearn.preprocessing`

module. - Apply the
`fit()`

method to compute the mean and standard deviation of the features. - Then finally use the
`transform()`

to scale the data.

`from sklearn.preprocessing import StandardScaler`scaler = StandardScaler()

scaler.fit(customer_df)

scaled_data = scaler.transform(customer_df)

## Determine the best number of cluster

A clustering model will not be relevant if we fail to identify the correct number of clusters to consider. Multiple techniques exist in the literature. We are going to consider the Elbow method, which is a heuristic method, and one of the widely used to find the optimal number of clusters.

The first helper function creates for each value of `K`

the corresponding `KMeans`

model and saves its inertia along with the actual `K`

value.

The second function uses those inertias and `K`

values to generate the final Elbow plot.

`def find_best_clusters(df, maximum_K):`clusters_centers = []

k_values = []

for k in range(1, maximum_K):

kmeans_model = KMeans(n_clusters = k)

kmeans_model.fit(df)

clusters_centers.append(kmeans_model.inertia_)

k_values.append(k)

return clusters_centers, k_values

`def generate_elbow_plot(clusters_centers, k_values):`figure = plt.subplots(figsize = (12, 6))

plt.plot(k_values, clusters_centers, 'o-', color = 'orange')

plt.xlabel("Number of Clusters (K)")

plt.ylabel("Cluster Inertia")

plt.title("Elbow Plot of KMeans")

plt.show()

Now, we can apply the functions to the dataset using a maximum `K`

value of `12`

and get the final result.

`clusters_centers, k_values = find_best_clusters(scaled_data, 12)`generate_elbow_plot(clusters_centers, k_values)

Below is the final result.

From the plot, we notice that the cluster inertia decreases as we increase the number of clusters. Also the drop the inertia is minimal after `K=5`

hence `5`

can be considered as the optimal number of clusters.

## Create the final KMeans model

Once we have determined the optimal number of clusters, we can finally apply the KMeans model to that value as follows.

`kmeans_model = KMeans(n_clusters = 5)`kmeans_model.fit(scaled_data)

We can access the cluster to which each data point belongs by using the `.labels_`

attribute. Let’s create a new column corresponding to those values.

`customer_df["clusters"] = kmeans_model.labels_`customer_df.head()

By looking at the first 5 customers, we can observe that the first two and last two have been assigned to the first cluster (cluster #1), whereas the third customer belongs to the third cluster (cluster #3)

## Visualize the clusters

Now that we have generated the clusters, the final step is to visualize them.

`plt.scatter(customer_df["Spending Score (1-100)"], `

customer_df["Annual Income (k$)"],

c = customer_df["clusters"])

The KMeans clustering seems to generate a pretty good result, and the five clusters are well separated from each other, even though there is a slight overlap between the purple and the yellow clusters.

The general observation is that :

- Customers on the top left have a low spending score and a high annual income. A good marketing strategy could be implemented to target those customers so that they can spend more.
- On the other hand, customers on the bottom left have a low annual income and also spends less, which makes sense, because they are trying to adjust their spending habit to their budget.
- The top right customers are similar to the bottom left, the difference is that they have enough budget to spend.
- Finally, the yellow group of customers spends beyond their budget.

Congrats, you have learned how to perform KMeans clustering using Python. I hope you’ve gained the required skills to efficiently analyze your unlabeled datasets.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on my social networks. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

The source code of the blog is available on GitHub.

## A complete overview of the KMeans clustering and implementation with Python

Imagine that you are a Data Scientist working for a retail company and your boss requests for the customers’ segmentation into the following groups: low, average, medium, or platinum customers based on spending behavior for targeted marketing purposes and product recommendations.

*Knowing that there is no such historical label associated with those customers, how is it possible to categorize them?*

This is where clustering can help. It is an unsupervised machine-learning technique used to group unlabeled data into similar categories or clusters.

This conceptual article will focus more on the K-means clustering approach, one of the many techniques in unsupervised machine learning. It will start by providing an overview of what K-means clustering is, before walking you through a step-by-step implementation in Python using the popular `Scikit-learn`

library.

The idea behind K-means clustering is to divide a dataset into a specified number of clusters (k), where all the points within the same cluster are similar to one another, and those in different clusters are different.

It starts by randomly assigning each data point to a cluster, and then it iteratively improves the clusters by moving the data points to the cluster center that is closest to them. This logic continues until the cluster assignments stop changing, or a maximum number of iterations is reached.

## What are the key steps of K-means clustering?

Below are the five main steps of the k-means algorithm:

Below we can see an illustration of K-means where the convergence is reached at the 14th iteration.

Now that we have an understanding of how k-means works, let’s see how to implement it in Python.

To begin, you need to install the following libraries:

`Pandas`

for loading the data frame.`Matplotlib`

for data visualizations.`Scikit-learn`

to use the`Kmean`

algorithm.

The installation can be performed as follows using `pip`

, the Python package manager:

`# Scikit Learn`

pip install scikit-learn# Pandas

pip install pandas

# Matplotlib

pip install matplotlib

## Import libraries and load the data

Now that you have an understanding of the K-means clustering algorithm, let’s dive deep. We will be using the Mall Customer Data freely available on Kaggle.

It contains for each customer this basic information: `ID`

, `Gender`

, `Age`

, `Income`

, and `Annual Spending score`

`import pandas as pd`

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans# load the customer data into a DataFrame

customer_df = pd.read_csv('customer_data.csv')

# Check the first 5 rows

customer_df.head()

The previous `.head()`

statement should generate the following result:

## Explore the data

Let’s have a quick statistical and visual understanding of the data before any further implementation of the algorithm.

`plt.scatter(customer_df["Age"], `

customer_df["Spending Score (1-100)"])plt.xlabel("Age")

plt.ylabel("Spending Score (1-100)")

` plt.scatter(customer_df["Age"], `

customer_df["Annual Income (k$)"])plt.xlabel("Age")

plt.ylabel("Annual Income (k$)")

`plt.scatter(customer_df["Spending Score (1-100)"], `

customer_df["Annual Income (k$)"])plt.xlabel("Spending Score (1-100)")

plt.ylabel("Annual Income (k$)")

All these plots have different results, thus leading to different interpretations. For instance, the first plot seems to propose two different groups of customers, whereas the second one is not obvious and from the last one it looks like there are five different groups. This is where Kmeans will be helpful in efficiently generating the correct groups/clusters.

Also, we notice from the following result that there are no missing values in the data.

`# Check for null values`

customer_df.isnull().sum()

## Get the relevant columns for clustering

Not all the columns are relevant for the clustering. In this example, we will use the numerical ones: `Age`

, `Annual Income`

, and `Spending Score`

`relevant_cols = ["Age", "Annual Income (k$)", "Spending Score (1-100)"]`customer_df = customer_df[relevant_cols]

## Data Transformation

Kmeans is sensitive to the measurement units and scales of the data. It is better to standardize the data first to tackle this issue. Also, this is a common practice prior to implementing any machine learning model.

Basically, the standardization substracts the mean of any feature from the actual values of that feature and divides the feature’s standard deviation.

The process is straightforward and is done as follows in Python:

- Use the
`StandardScaler`

class from the`sklearn.preprocessing`

module. - Apply the
`fit()`

method to compute the mean and standard deviation of the features. - Then finally use the
`transform()`

to scale the data.

`from sklearn.preprocessing import StandardScaler`scaler = StandardScaler()

scaler.fit(customer_df)

scaled_data = scaler.transform(customer_df)

## Determine the best number of cluster

A clustering model will not be relevant if we fail to identify the correct number of clusters to consider. Multiple techniques exist in the literature. We are going to consider the Elbow method, which is a heuristic method, and one of the widely used to find the optimal number of clusters.

The first helper function creates for each value of `K`

the corresponding `KMeans`

model and saves its inertia along with the actual `K`

value.

The second function uses those inertias and `K`

values to generate the final Elbow plot.

`def find_best_clusters(df, maximum_K):`clusters_centers = []

k_values = []

for k in range(1, maximum_K):

kmeans_model = KMeans(n_clusters = k)

kmeans_model.fit(df)

clusters_centers.append(kmeans_model.inertia_)

k_values.append(k)

return clusters_centers, k_values

`def generate_elbow_plot(clusters_centers, k_values):`figure = plt.subplots(figsize = (12, 6))

plt.plot(k_values, clusters_centers, 'o-', color = 'orange')

plt.xlabel("Number of Clusters (K)")

plt.ylabel("Cluster Inertia")

plt.title("Elbow Plot of KMeans")

plt.show()

Now, we can apply the functions to the dataset using a maximum `K`

value of `12`

and get the final result.

`clusters_centers, k_values = find_best_clusters(scaled_data, 12)`generate_elbow_plot(clusters_centers, k_values)

Below is the final result.

From the plot, we notice that the cluster inertia decreases as we increase the number of clusters. Also the drop the inertia is minimal after `K=5`

hence `5`

can be considered as the optimal number of clusters.

## Create the final KMeans model

Once we have determined the optimal number of clusters, we can finally apply the KMeans model to that value as follows.

`kmeans_model = KMeans(n_clusters = 5)`kmeans_model.fit(scaled_data)

We can access the cluster to which each data point belongs by using the `.labels_`

attribute. Let’s create a new column corresponding to those values.

`customer_df["clusters"] = kmeans_model.labels_`customer_df.head()

By looking at the first 5 customers, we can observe that the first two and last two have been assigned to the first cluster (cluster #1), whereas the third customer belongs to the third cluster (cluster #3)

## Visualize the clusters

Now that we have generated the clusters, the final step is to visualize them.

`plt.scatter(customer_df["Spending Score (1-100)"], `

customer_df["Annual Income (k$)"],

c = customer_df["clusters"])

The KMeans clustering seems to generate a pretty good result, and the five clusters are well separated from each other, even though there is a slight overlap between the purple and the yellow clusters.

The general observation is that :

- Customers on the top left have a low spending score and a high annual income. A good marketing strategy could be implemented to target those customers so that they can spend more.
- On the other hand, customers on the bottom left have a low annual income and also spends less, which makes sense, because they are trying to adjust their spending habit to their budget.
- The top right customers are similar to the bottom left, the difference is that they have enough budget to spend.
- Finally, the yellow group of customers spends beyond their budget.

Congrats, you have learned how to perform KMeans clustering using Python. I hope you’ve gained the required skills to efficiently analyze your unlabeled datasets.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on my social networks. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

The source code of the blog is available on GitHub.

**Denial of responsibility!**Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.