Unsupervised Learning: K-Means Clustering | by Brendan Artley | Jun, 2022

By Jessie Hobb On Jun 29, 2022

The fastest and most intuitive unsupervised clustering algorithm.

Clusters Image — By Author

In this article, we will go through the k-means clustering algorithm. We will first start looking at how the algorithm works, then we will implement it using NumPy, and finally, we will look at implementing it using Scikit-learn.

K-means clustering is an unsupervised algorithm that groups unlabelled data into different clusters. The K in its title represents the number of clusters that will be created. This is something that should be known prior to the model training. For example, if K=4 then 4 clusters would be created, and if K=7 then 7 clusters would be created. The k-means algorithm is used in fraud detection, error detection, and confirming existing clusters in the real world.

The algorithm is centroid-based, meaning that each data point is assigned to the cluster with the closest centroid. This algorithm can be used for any number of dimensions as we calculate the distance to centroids using the euclidian distance. More on this in the next section.

The benefits of the k-means algorithm are that it is easy to implement, it scales to large datasets, it will always converge, and it fits clusters with varying shapes and sizes. Some disadvantages of the model are that the number of clusters is chosen manually, the clusters are dependent on initial values, and that it is sensitive to outliers.

K-means Steps

The steps of training a K-means model can be broken down into 6 steps.

Step 1: Determine the number of clusters (K=?)

It is best if K is known before model training, but if not, there are strategies to find K. The most common is the elbow method, which plots the sum of the squared distances as K increases. By looking at the plot, there should be a point where increasing the size of the cluster provides minimal gain to the error function. This is known as the elbow and should be chosen as the value for K. This is shown in the following graphic.

Step 2: Initialize cluster centroids

The next step is to initiate K centroids as the centers of each cluster. The most common initialization strategy is called Forgy Initialization. This is when the centroids for each cluster are initiated as random data points from the dataset. This converges quicker than random initialization as clusters are more likely to be present near data points.

More complex strategies like k-means++ initialization aim to choose initial points that are as far from each other as possible. This strategy has proven to be the best initialization method for the k-means algorithm. You can see the mathematics behind k-means++ here.

Step 3: Assign data points to clusters

After initializing K clusters, each data point is assigned to a cluster. This is done by iterating over all the points in the data and calculating the euclidian distance to each centroid. The formula for euclidian distance works with any two points in n-dimensional euclidian space. This means that a data point with any number of dimensions is assigned to the closest cluster.

Step 4: Update cluster centroids

Once each data point is assigned to a cluster we can update the centroids of each cluster. This is done by taking the mean value of each data point in the cluster and assigning the result as the new center of the cluster.

Step 5: Iteratively Update

Then, using the newly calculated centroids we go through all the data points and re-assign them to clusters and re-calculate the centroids. This is done repeatedly until the centroid values do not change.

The beauty of the k-means algorithm is that it is guaranteed to converge. This is a blessing and a curse as the model may converge to a local minimum rather than a global minimum. This idea will be illustrated in the following section where we implement the algorithm using Numpy, followed by the implementation in Scikit-learn.

First, we will import the necessary python packages and create a 2-dimensional data set using Scikit-learn’s make_blob function. For this article, we will be generating 300 data points that are distributed amongst 4 clusters. The generated data is shown below.

The fastest and most intuitive unsupervised clustering algorithm.

K-means Steps

The steps of training a K-means model can be broken down into 6 steps.

Step 1: Determine the number of clusters (K=?)

Step 2: Initialize cluster centroids

Step 3: Assign data points to clusters

Step 4: Update cluster centroids

Step 5: Iteratively Update

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Unsupervised Learning: K-Means Clustering | by Brendan Artley | Jun, 2022

The fastest and most intuitive unsupervised clustering algorithm.

K-means Steps

Final Thoughts

The fastest and most intuitive unsupervised clustering algorithm.

K-means Steps

Final Thoughts