Techno Blender
Digitally Yours.

Customer Segmentation with k-means and PCA in Python

0 75


A step-by-step guide to implementing k-means clustering with PCA

Photo by Carlos Muza on Unsplash

So far in our journey towards implementing the STP marketing framework (Segmentation, Targeting, and Positioning) we’ve done the following:

In the first article, we’ve
❏ defined our goal and learned the basics
❏ setup the Deepnote development environment
❏ explored different attributes/features of the customer dataset that we have been working with
❏ processed and standardized our dataset
❏ segmented the customers using hierarchical clustering

Then, in the second article, we’ve
❏ learned about the k-means clustering algorithm (flat clustering) and elbow method to identify the optimal number of clusters in a dataset
❏ segmented the customer dataset with k-means clustering and named our customer segments
❏ analyzed the segmentation outcome

As we’ve seen in the second post that our clustering method couldn’t clearly distinguish the different customer segments, in this post, we’ll try to improve that using dimensionality reduction.

Principal Component Analysis (PCA)

When multiple features in a dataset are highly correlated it can skew the outcome of the model because of redundant information. That’s what happened with our k-means model. It is called a multicollinearity problem. We can solve this by reducing dimensionality.

Learn more about multicollinearity.

The correlation matrix in the first article showed that Age and Education are correlated, and Income and Occupation are also correlated. We will tackle this using Principle Component Analysis (PCA), a dimensionality reduction method.

Dimensionality reduction is the process of reducing the number of attributes of a dataset while retaining meaningful properties of the original data. As Shakespeare said, “sometimes less is more”. Dimensionality reduction is not exactly that but close 😛

Learn more about dimensionality reduction.

PCA transforms a set of correlated variables (p) into a smaller number of uncorrelated variables k (k < p) called principle components while keeping as much variation in the original dataset as possible. This is performed in the data pre-processing step on the standardized dataset.

Learn more about PCA.

Identifying principal components

First thing first, let’s import the PCA library from sklearn and create our pca object with the standardized customer dataset.

The property “explained_variance_ratio_” of the pca object contains seven components that explain 100% of the variability of our dataset. The first component explains about 36% of the variability, 2nd and 3rd component explain 26% and 19% of variability, respectively.

The rule of thumb is to pick the number of components that retain 70–80% of the variability. If we select three top components, they will already hold more than 80% variability, and if we pick four components, they will retain almost 90% variability. Let’s pick three components and fit our pca model.
Then we create a dataframe with the three principle components while using the columns from our original dataset. Notice, that all values in the dataframe are between negative one and one as they are essentially correlations.

Now, let’s have a look at the new correlation matrix.

Component one has a positive correlation with age, income, occupation, and settlement size. These features are related to the career of a person.

On the other hand, sex, marital status, and education are the most prominent determinant for the second component. We can also see in this component all career-related features are uncorrelated. Therefore, this component doesn’t refer to an individual’s profession but rather education and lifestyle.

For the third component, we observe that age, marital status, and occupation are the most prominent determinants. Marital status and occupation weigh negatively but are still important.

Implementing K-Means Clustering with PCA

OK, now we have an idea about what our new variables or components represent. Let’s implement k-means clustering considering our three components as features. We’ll skip the elbow method as we’ve already learned about that in the second article. So, we’ll directly start to implement the k-means algorithm with four clusters. The notebook contains detailed implementation for your reference.

Analyze segmentation results

Previously we’ve established that component one represents career, component two represents education & lifestyle and component three represents life or work experience.

Now let’s analyze the segmentation result and try to label them like before.

Segment 0: low career and experience values with high education and lifestyle values.
Label: Standard
Segment 1: high career but low education, lifestyle, and experience
Label: Career-focused
Segment 2: low career, education and lifestyle, but high life experience
Label: Fewer opportunities
Segment 3: high career, education and lifestyle as well as high life experience
Label: Well-off

Let’s take a look at the number of customers per segment:

Now let’s visualize the segments with respect to the first two components.

For comparison, here is the scatter plot of the raw k-means implementation without the PCA (article 2 of this series).

Image by Author

As you can see, now the four segments are distinctly identifiable. Though standard and fewer-opportunity have some overlaps, still the overall result is far better than the previous outcome.

Conclusion

So far, we have divided our customers into four different and clearly identifiable groups. With this, we have completed the “segmentation” part of the STP framework. Since “targeting” mostly involves business decisions on which customer segment to focus on, we’ll jump over to “positioning” in the next article.

As always, the entire code and all the supporting datasets are available in the Deepnote notebook.


A step-by-step guide to implementing k-means clustering with PCA

Photo by Carlos Muza on Unsplash

So far in our journey towards implementing the STP marketing framework (Segmentation, Targeting, and Positioning) we’ve done the following:

In the first article, we’ve
❏ defined our goal and learned the basics
❏ setup the Deepnote development environment
❏ explored different attributes/features of the customer dataset that we have been working with
❏ processed and standardized our dataset
❏ segmented the customers using hierarchical clustering

Then, in the second article, we’ve
❏ learned about the k-means clustering algorithm (flat clustering) and elbow method to identify the optimal number of clusters in a dataset
❏ segmented the customer dataset with k-means clustering and named our customer segments
❏ analyzed the segmentation outcome

As we’ve seen in the second post that our clustering method couldn’t clearly distinguish the different customer segments, in this post, we’ll try to improve that using dimensionality reduction.

Principal Component Analysis (PCA)

When multiple features in a dataset are highly correlated it can skew the outcome of the model because of redundant information. That’s what happened with our k-means model. It is called a multicollinearity problem. We can solve this by reducing dimensionality.

Learn more about multicollinearity.

The correlation matrix in the first article showed that Age and Education are correlated, and Income and Occupation are also correlated. We will tackle this using Principle Component Analysis (PCA), a dimensionality reduction method.

Dimensionality reduction is the process of reducing the number of attributes of a dataset while retaining meaningful properties of the original data. As Shakespeare said, “sometimes less is more”. Dimensionality reduction is not exactly that but close 😛

Learn more about dimensionality reduction.

PCA transforms a set of correlated variables (p) into a smaller number of uncorrelated variables k (k < p) called principle components while keeping as much variation in the original dataset as possible. This is performed in the data pre-processing step on the standardized dataset.

Learn more about PCA.

Identifying principal components

First thing first, let’s import the PCA library from sklearn and create our pca object with the standardized customer dataset.

The property “explained_variance_ratio_” of the pca object contains seven components that explain 100% of the variability of our dataset. The first component explains about 36% of the variability, 2nd and 3rd component explain 26% and 19% of variability, respectively.

The rule of thumb is to pick the number of components that retain 70–80% of the variability. If we select three top components, they will already hold more than 80% variability, and if we pick four components, they will retain almost 90% variability. Let’s pick three components and fit our pca model.
Then we create a dataframe with the three principle components while using the columns from our original dataset. Notice, that all values in the dataframe are between negative one and one as they are essentially correlations.

Now, let’s have a look at the new correlation matrix.

Component one has a positive correlation with age, income, occupation, and settlement size. These features are related to the career of a person.

On the other hand, sex, marital status, and education are the most prominent determinant for the second component. We can also see in this component all career-related features are uncorrelated. Therefore, this component doesn’t refer to an individual’s profession but rather education and lifestyle.

For the third component, we observe that age, marital status, and occupation are the most prominent determinants. Marital status and occupation weigh negatively but are still important.

Implementing K-Means Clustering with PCA

OK, now we have an idea about what our new variables or components represent. Let’s implement k-means clustering considering our three components as features. We’ll skip the elbow method as we’ve already learned about that in the second article. So, we’ll directly start to implement the k-means algorithm with four clusters. The notebook contains detailed implementation for your reference.

Analyze segmentation results

Previously we’ve established that component one represents career, component two represents education & lifestyle and component three represents life or work experience.

Now let’s analyze the segmentation result and try to label them like before.

Segment 0: low career and experience values with high education and lifestyle values.
Label: Standard
Segment 1: high career but low education, lifestyle, and experience
Label: Career-focused
Segment 2: low career, education and lifestyle, but high life experience
Label: Fewer opportunities
Segment 3: high career, education and lifestyle as well as high life experience
Label: Well-off

Let’s take a look at the number of customers per segment:

Now let’s visualize the segments with respect to the first two components.

For comparison, here is the scatter plot of the raw k-means implementation without the PCA (article 2 of this series).

Image by Author

As you can see, now the four segments are distinctly identifiable. Though standard and fewer-opportunity have some overlaps, still the overall result is far better than the previous outcome.

Conclusion

So far, we have divided our customers into four different and clearly identifiable groups. With this, we have completed the “segmentation” part of the STP framework. Since “targeting” mostly involves business decisions on which customer segment to focus on, we’ll jump over to “positioning” in the next article.

As always, the entire code and all the supporting datasets are available in the Deepnote notebook.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment