From Clusters To Insights; The Next Step | by Erdogan Taskesen | May, 2023


For this use case, we will load the online shoppers’ intentions data set and go through the steps of preprocessing, clustering, evaluation and then determining the significantly associated features for the cluster labels. This data set contains in total of 12330 samples with 18 features. This mixed dataset requires a few more pre-processing steps to make sure that all variables have similar types or units of measurement. Thus, the first step is to create homogeneous data sets with units that are comparable. A common manner is by discretizing and creating a one-hot matrix. I will use the df2onehot library, with the following pre-processing steps to discretize:

  • Categorical values 0, None, ? and False are removed.
  • One-hot features with less than 50 positive values are removed.
  • For features that had 2 categories, only one is kept.
  • Features with 80% unique values or more are considered to be numeric.

The pre-processing step converted the data set into a one-hot matrix containing the same 12330 samples but now with 121 one-hot features. Notably, the above-mentioned criteria are not a golden standard but should be explored for each use case. For clustering, we will be using agglomerative clustering with hamming distance and complete linkage. See code section below.

# Intall libraries
pip install df2onehot

# Import libraries
from clusteval import clusteval
from df2onehot import df2onehot

# Load data from UCI
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv'

# Initialize clusteval
ce = clusteval()
# Import data from url
df = ce.import_example(url=url)

# Preprocessing
cols_as_float = ['ProductRelated', 'Administrative']
df[cols_as_float]=df[cols_as_float].astype(float)
dfhot = df2onehot(df, excl_background=['0.0', 'None', '?', 'False'], y_min=50, perc_min_num=0.8, remove_mutual_exclusive=True, verbose=4)['onehot']

# Initialize using the specific parameters
ce = clusteval(evaluate='silhouette',
cluster='agglomerative',
metric='hamming',
linkage='complete',
min_clust=2,
verbose='info')

# Clustering and evaluation
results = ce.fit(dfhot)

# [clusteval] >INFO> Saving data in memory.
# [clusteval] >INFO> Fit with method=[agglomerative], metric=[hamming], linkage=[complete]
# [clusteval] >INFO> Evaluate using silhouette.
# [clusteval] >INFO: 100%|██████████| 23/23 [00:28<00:00, 1.23s/it]
# [clusteval] >INFO> Compute dendrogram threshold.
# [clusteval] >INFO> Optimal number clusters detected: [9].
# [clusteval] >INFO> Fin.

After running clusteval on the data set, it returns 9 clusters. Because the data contains 121 dimensions (the features), we can not directly visually inspect the clusters in a scatterplot. However, we can perform an embedding and then visually inspect the data using a scatterplot as shown in the code section below. The embedding is automatically performed when specifying embedding='tsne'.

# Plot the Silhouette and show the scatterplot using tSNE
ce.plot_silhouette(embedding='tsne')
Figure 1. Left panel: Silhouette score plot with the detected clusters and labels. Right panel: scatterplot where samples are colored on the cluster labels. The colors and cluster labels are matching between the two panels. Image by the author.

The results in Figure 1 (right panel) depict the scatterplot after a t-SNE embedding, where the samples are colored on the cluster labels. In the left panel is shown the Silhouette plot where we can visually assess the quality of the clustering results, such as the homogeneity, separation of clusters, and the optimal number of clusters that are detected using the clustering algorithm.

Moreover, the Silhouette score ranges from -1 to 1 (x-axis) for which a score close to 1 indicates that data points within a cluster are very similar to each other and dissimilar to points in other clusters. Clusters 0, 2, 3, and 5 imply to be well-separated clusters. A Silhouette score close to 0 indicates overlapping clusters or that the data points are equally similar to their own cluster and neighboring clusters. A score close to -1 suggests that data points are more similar to points in neighboring clusters than to their own cluster.

The width of the bars represents the density or size of each cluster. Wider bars indicate larger clusters with more data points, while narrower bars indicate smaller clusters with fewer data points. The dashed red line (close to 0 in our case) represents the average silhouette score for all data points. It serves as a reference to assess the overall quality of clustering. Clusters with average silhouette scores above the dashed line are considered well-separated, while clusters with scores below the dashed line may indicate poor clustering. In general, a good clustering should have silhouette scores close to 1, indicating well-separated clusters. However, be aware that we now have clustered our data in high dimensions and evaluate the clustering results after a t-SNE embedding in the low 2-dimensional space. The projection can give a different view of the reality.

Alternatively, we can also do the embedding first and then cluster the data on the low-dimensional space (see code section below). Now we will use the Euclidean distance metric because our input data is not one-hot anymore but are the coordinates from the t-SNE mapping. After fitting, we detect an optimal number of 27 clusters, which is a lot more than in our previous results. We can see that the cluster evaluation scores (Figure 2) appear to be turbulent. This has to do with the structure of the data and whether an optimal clustering can be formed.

# Initialize library
from sklearn.manifold import TSNE
xycoord = TSNE(n_components=2, init='random', perplexity=30).fit_transform(dfhot.values)

# Initialize clusteval
ce = clusteval(cluster='agglomerative', metric='euclidean', linkage='complete', min_clust=5, max_clust=30)

# Clustering and evaluation
results = ce.fit(xycoord)

# Make plots
ce.plot()
ce.plot_silhouette()

Figure 2. Cluster evaluation scores (higher is better).
Figure 3. Left panel: Silhouette score plot with the detected clusters and labels. Right panel: scatterplot where samples are colored on the cluster labels. The colors and cluster labels are matching between the two panels. Image by the author.

The Silhouette plot now shows better results than previously, indicating that clusters are better separated. In the next section, we will detect which features are significantly associated with the cluster labels.

After determining the optimal number of clusters comes the challenging step; to understand which features drive to the formation of the clusters.


For this use case, we will load the online shoppers’ intentions data set and go through the steps of preprocessing, clustering, evaluation and then determining the significantly associated features for the cluster labels. This data set contains in total of 12330 samples with 18 features. This mixed dataset requires a few more pre-processing steps to make sure that all variables have similar types or units of measurement. Thus, the first step is to create homogeneous data sets with units that are comparable. A common manner is by discretizing and creating a one-hot matrix. I will use the df2onehot library, with the following pre-processing steps to discretize:

  • Categorical values 0, None, ? and False are removed.
  • One-hot features with less than 50 positive values are removed.
  • For features that had 2 categories, only one is kept.
  • Features with 80% unique values or more are considered to be numeric.

The pre-processing step converted the data set into a one-hot matrix containing the same 12330 samples but now with 121 one-hot features. Notably, the above-mentioned criteria are not a golden standard but should be explored for each use case. For clustering, we will be using agglomerative clustering with hamming distance and complete linkage. See code section below.

# Intall libraries
pip install df2onehot

# Import libraries
from clusteval import clusteval
from df2onehot import df2onehot

# Load data from UCI
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv'

# Initialize clusteval
ce = clusteval()
# Import data from url
df = ce.import_example(url=url)

# Preprocessing
cols_as_float = ['ProductRelated', 'Administrative']
df[cols_as_float]=df[cols_as_float].astype(float)
dfhot = df2onehot(df, excl_background=['0.0', 'None', '?', 'False'], y_min=50, perc_min_num=0.8, remove_mutual_exclusive=True, verbose=4)['onehot']

# Initialize using the specific parameters
ce = clusteval(evaluate='silhouette',
cluster='agglomerative',
metric='hamming',
linkage='complete',
min_clust=2,
verbose='info')

# Clustering and evaluation
results = ce.fit(dfhot)

# [clusteval] >INFO> Saving data in memory.
# [clusteval] >INFO> Fit with method=[agglomerative], metric=[hamming], linkage=[complete]
# [clusteval] >INFO> Evaluate using silhouette.
# [clusteval] >INFO: 100%|██████████| 23/23 [00:28<00:00, 1.23s/it]
# [clusteval] >INFO> Compute dendrogram threshold.
# [clusteval] >INFO> Optimal number clusters detected: [9].
# [clusteval] >INFO> Fin.

After running clusteval on the data set, it returns 9 clusters. Because the data contains 121 dimensions (the features), we can not directly visually inspect the clusters in a scatterplot. However, we can perform an embedding and then visually inspect the data using a scatterplot as shown in the code section below. The embedding is automatically performed when specifying embedding='tsne'.

# Plot the Silhouette and show the scatterplot using tSNE
ce.plot_silhouette(embedding='tsne')
Figure 1. Left panel: Silhouette score plot with the detected clusters and labels. Right panel: scatterplot where samples are colored on the cluster labels. The colors and cluster labels are matching between the two panels. Image by the author.

The results in Figure 1 (right panel) depict the scatterplot after a t-SNE embedding, where the samples are colored on the cluster labels. In the left panel is shown the Silhouette plot where we can visually assess the quality of the clustering results, such as the homogeneity, separation of clusters, and the optimal number of clusters that are detected using the clustering algorithm.

Moreover, the Silhouette score ranges from -1 to 1 (x-axis) for which a score close to 1 indicates that data points within a cluster are very similar to each other and dissimilar to points in other clusters. Clusters 0, 2, 3, and 5 imply to be well-separated clusters. A Silhouette score close to 0 indicates overlapping clusters or that the data points are equally similar to their own cluster and neighboring clusters. A score close to -1 suggests that data points are more similar to points in neighboring clusters than to their own cluster.

The width of the bars represents the density or size of each cluster. Wider bars indicate larger clusters with more data points, while narrower bars indicate smaller clusters with fewer data points. The dashed red line (close to 0 in our case) represents the average silhouette score for all data points. It serves as a reference to assess the overall quality of clustering. Clusters with average silhouette scores above the dashed line are considered well-separated, while clusters with scores below the dashed line may indicate poor clustering. In general, a good clustering should have silhouette scores close to 1, indicating well-separated clusters. However, be aware that we now have clustered our data in high dimensions and evaluate the clustering results after a t-SNE embedding in the low 2-dimensional space. The projection can give a different view of the reality.

Alternatively, we can also do the embedding first and then cluster the data on the low-dimensional space (see code section below). Now we will use the Euclidean distance metric because our input data is not one-hot anymore but are the coordinates from the t-SNE mapping. After fitting, we detect an optimal number of 27 clusters, which is a lot more than in our previous results. We can see that the cluster evaluation scores (Figure 2) appear to be turbulent. This has to do with the structure of the data and whether an optimal clustering can be formed.

# Initialize library
from sklearn.manifold import TSNE
xycoord = TSNE(n_components=2, init='random', perplexity=30).fit_transform(dfhot.values)

# Initialize clusteval
ce = clusteval(cluster='agglomerative', metric='euclidean', linkage='complete', min_clust=5, max_clust=30)

# Clustering and evaluation
results = ce.fit(xycoord)

# Make plots
ce.plot()
ce.plot_silhouette()

Figure 2. Cluster evaluation scores (higher is better).
Figure 3. Left panel: Silhouette score plot with the detected clusters and labels. Right panel: scatterplot where samples are colored on the cluster labels. The colors and cluster labels are matching between the two panels. Image by the author.

The Silhouette plot now shows better results than previously, indicating that clusters are better separated. In the next section, we will detect which features are significantly associated with the cluster labels.

After determining the optimal number of clusters comes the challenging step; to understand which features drive to the formation of the clusters.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsClustersErdoganInsightslatest newsStepTaskesenTechnoblender
Comments (0)
Add Comment