Customer Segmentation with k-means and PCA in Python

By Jessie Hobb On Aug 12, 2022

A step-by-step guide to implementing k-means clustering with PCA

So far in our journey towards implementing the STP marketing framework (Segmentation, Targeting, and Positioning) we’ve done the following:

In the first article, we’ve
❏ defined our goal and learned the basics
❏ setup the Deepnote development environment
❏ explored different attributes/features of the customer dataset that we have been working with
❏ processed and standardized our dataset
❏ segmented the customers using hierarchical clustering

Then, in the second article, we’ve
❏ learned about the k-means clustering algorithm (flat clustering) and elbow method to identify the optimal number of clusters in a dataset
❏ segmented the customer dataset with k-means clustering and named our customer segments
❏ analyzed the segmentation outcome

As we’ve seen in the second post that our clustering method couldn’t clearly distinguish the different customer segments, in this post, we’ll try to improve that using dimensionality reduction.

Principal Component Analysis (PCA)

When multiple features in a dataset are highly correlated it can skew the outcome of the model because of redundant information. That’s what happened with our k-means model. It is called a multicollinearity problem. We can solve this by reducing dimensionality.

Learn more about multicollinearity.

The correlation matrix in the first article showed that Age and Education are correlated, and Income and Occupation are also correlated. We will tackle this using Principle Component Analysis (PCA), a dimensionality reduction method.

Dimensionality reduction is the process of reducing the number of attributes of a dataset while retaining meaningful properties of the original data. As Shakespeare said, “sometimes less is more”. Dimensionality reduction is not exactly that but close 😛

Learn more about dimensionality reduction.

PCA transforms a set of correlated variables (p) into a smaller number of uncorrelated variables k (k < p) called principle components while keeping as much variation in the original dataset as possible. This is performed in the data pre-processing step on the standardized dataset.

Learn more about PCA.

Identifying principal components

First thing first, let’s import the PCA library from sklearn and create our pca object with the standardized customer dataset.

A step-by-step guide to implementing k-means clustering with PCA

So far in our journey towards implementing the STP marketing framework (Segmentation, Targeting, and Positioning) we’ve done the following:

As we’ve seen in the second post that our clustering method couldn’t clearly distinguish the different customer segments, in this post, we’ll try to improve that using dimensionality reduction.

Principal Component Analysis (PCA)

Learn more about multicollinearity.

Learn more about dimensionality reduction.

Learn more about PCA.

Identifying principal components

First thing first, let’s import the PCA library from sklearn and create our pca object with the standardized customer dataset.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Customer Segmentation with k-means and PCA in Python

A step-by-step guide to implementing k-means clustering with PCA

Principal Component Analysis (PCA)

Identifying principal components

Implementing K-Means Clustering with PCA

Analyze segmentation results

Conclusion

A step-by-step guide to implementing k-means clustering with PCA

Principal Component Analysis (PCA)

Identifying principal components

Implementing K-Means Clustering with PCA

Analyze segmentation results

Conclusion