A Simple Explanation of K-Means Clustering and Its Advantages

Contents

This article was published as part of the Data Science Blogathon.

Overview

K-means clustering is a very famous and powerful unsupervised machine learning algorithm. Used to solve many complex unsupervised machine learning problems. Before starting, Let's take a look at the points that we are going to understand.

dbscan-clustering-algorithm-8040109

Table of Contents

  • Introduction
  • How does the K-means algorithm work?
  • How to choose the value of K?
    • Elbow method.
    • Silhouette method.
  • Advantages of k-means.
  • Disadvantages of k-means.

Introduction

Let's understand the K-means clustering algorithm with its simple definition.

A K-means clustering algorithm attempts to cluster similar items as clusters. The number of groups is represented by K.

Let's take an example. Suppose you went to a vegetable store to buy some vegetables. There you will see different types of vegetables. The only thing you will notice there is that the vegetables will be arranged in a group of their types. As all carrots will stay in one place, potatoes will stick with their types and so on. If you notice here, then you will find that they are forming a group or group, where each of the vegetables is kept within its type of group forming the groups.

Now we will understand this with the help of a beautiful figure.

56854k20means20clustering-4556146

Now, look at the two figures above. What did you observe? Let's talk about the first figure. The first figure shows the data before applying the k-means clustering algorithm. Here the three different categories are messy. When you see that data in the real world, you will not be able to find out the different categories.

Now, look at the second figure (figure 2). This shows the data after applying the K-means clustering algorithm. you can see that the three different items are classified into three different categories which are called groups.

How does the K-means clustering algorithm work?

K-means grouping attempts to group similar types of items in the form of groupings. Find the similarity between the elements and group them into groups. The K-means clustering algorithm works in three steps. Let's see what these three steps are.

  1. Select the k values.
  2. Initialize the centroids.
  3. Select the group and find the average.

Let's understand the above steps with the help of the figure because a good image is better than thousands of words.

34513k20means-2610053

We will understand each figure one by one.

  • The figure 1 shows the data representation of two different elements. the first element has been shown in blue and the second element has been shown in red. Here I choose the value of K randomly as 2. There are different methods by which we can choose the correct k values.
  • In the figure 2, join the two selected points. Now, to find the centroid, we will draw a line perpendicular to that line. Points will move to their centroid. If you look there, you will see that some of the red dots now move to the blue dots. Now, these points belong to the group of blue elements.
  • The same process will continue in the figure 3. We will join the two points and we will draw a line perpendicular to that and we will find the centroid. Now the two points will move to their centroid and again some of the red points will turn into blue points.
  • The same process is happening in the figure 4. This process will continue until we get two completely different groups of these groups.

NOTE: Note that clustering K-means uses the Euclidean distance method to find out the distance between the points.

You will find many explanations about the Euclidean distance on the Internet.

How to choose the value of K?

One of the most challenging tasks of this clustering algorithm is choosing the correct values ​​of k. What should be the correct k value? How to choose the k value? Let's find the answer to these questions. If you choose the k values ​​at random, it may be right or wrong. If you choose the wrong value, will directly affect the performance of your model. Then, There are two methods by which you can select the correct value of k.

  1. Elbow method.
  2. Silhouette method.

Now, let's understand both concepts one by one in detail.

Elbow method

Elbow is one of the most famous methods by which you can select the correct value of k and increase the performance of your model. We also perform hyperparameter tuning to choose the best value of k. Let's see how this elbow method works.

It is an empirical method to find the best value of k. collect the range of values ​​and take the best of them. Calculate the sum of the square of the points and calculate the mean distance.

39268elbow20method-9207416

When the value of k is 1, the sum of the square within the group will be high. As the value of k increases, the sum of the squared value within the group will decrease.

Finally, we will plot a graph between the k values ​​and the sum of the square within the group to obtain the k value. We will examine the chart carefully. Sometime, our graph will decrease abruptly. That point will be considered as a value of k.

12158wcss-5721506

Silhouette method

The silhouette method is somewhat different. The elbow method also takes the range of k values ​​and draws the silhouette graph. Calculate the silhouette coefficient of each point. Find the average distance of points within your group to (I) and the average distance of the points to their next closest group called b (I).

75194silhouette-3423284

Note: The A (i) value must be less than b (I) value, what is ai << with a.

Now, we have the values ​​of a (i) and b (i). we will calculate the silhouette coefficient using the following formula.

59928untitled-1909945

Now, we can calculate the silhouette coefficient of all the points in the groups and plot the silhouette graph. This graph will also be useful to detect outliers. The plot of the silhouette is between -1 a 1.

Note that for the silhouette coefficient equal to -1 is the worst case.

Look at the graph and verify which of the k values ​​is closest to 1.

75098silhouette20method-7848691

What's more, check the chart that has the least outliers, what a less negative value means. Then choose that value of k for your model to tune it.

Advantages of K-means

  1. It is very easy to implement.
  2. It is scalable to a large data set and also faster to large data sets.
  3. adapt new examples very often.
  4. Clusters generalization for different shapes and sizes.

Disadvantages of K-means

  1. It is sensitive to outliers.
  2. Choosing k-values ​​manually is hard work.
  3. As the number of dimensions increases, its scalability decreases.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.