This article was published as part of the Data Science Blogathon.
Clustering or clustering is an unsupervised machine learning algorithm that groups unlabeled datasets. Their goal is to form clusters or clusters using the data points in a dataset in such a way that there is a high similarity between the clusters and a low similarity between the clusters.. In simple terms, clustering aims to form subsets or groups within a dataset consisting of data points that are actually similar to each other and the groups or subsets or clusters formed can differ significantly from each other.
Suppose we have a dataset and know nothing about it. Then, a grouping algorithm can discover groups of objects where the average distances between members / data points of each group are closer than to the members / data points in other groups.
Some of the practical applications of Clustering in real life such as:
1) Customer segmentation: Find a group of customers with similar behavior given a large customer database (a practical example is given using bank customer segmentation)
2) Network traffic classification: Grouping traffic source characteristics. Traffic types can be easily classified using clusters.
3) Spam filter: Data is grouped into different sections (header, sender and content) and then they can help classify which of them are spam.
4)City planning: Grouping of houses according to their geographical location, value and type of house.
Different types of grouping algorithms
1) Grouping of K-stockings – Using this algorithm, we classify a given dataset through a certain number of default clusters or “k” Clusters.
2) Hierarchical grouping – Follows two Divisive and Agglomeration approaches.
Agglomerative considers each observation as a single group and then groups similar data points until they merge into a single group and Divisive works right in front of it..
3) Fuzzy C stands for Clustering – The operation of the FCM algorithm is almost similar to the k-means grouping algorithm, the main difference is that in fcm a data point can be placed in more than one group.
4) Density-based spatial grouping – Useful in application areas where we require nonlinear cluster structures, based purely on density.
Now, here in this article, we will focus deeply on the k-means grouping algorithm, theoretical explanations of the operation of k-means, advantages and disadvantages, and a solved practical grouping problem that will improve theoretical understanding and give you proper insight. how k-media clustering works.
That it is k-half Clustering?
K-Means grouping is an unsupervised learning algorithm, which is used to group the unlabeled dataset into different groups / subsets.
Now you must be wondering what 'k stands for’ and 'means’ in the k-means Clustering means ??
Putting aside all your assumptions here, 'k’ defines the number of predefined groups to be created in the grouping process, let's say if k = 2, there will be two groups, and for k = 3, there will be three groups and so on. How is a centroid-based algorithm, 'stockings’ in the grouping of k-means is related to the centroid of the data points where each group is associated with a centroid. The concept of a centroid-based algorithm will be explained in the working explanation of k-means.
Principally, the k-means clustering algorithm performs two tasks:
- Determines the most optimal value for K central or centroid points using a repetitive process.
- Assign each data point to its nearest k-center. The cluster is created with data points that are close to the particular k center.
How does k-means clustering work??
Suppose we have two variables X1 and X2, scatter plot below:
(1) Suppose the value of k, which is the number of predefined groups, it is 2 (k = 2), so here we will group our data in 2 groups.
It is necessary to choose k random points to form the groups. There can be no restrictions on the selection of k random points from inside the data or from the outside. Then, here we are considering 2 points as k points (that are not part of our dataset) shown in the following figure:
(2) the next step is to map each data point in the dataset on the scatter plot to its nearest k point, this will be done by calculating the Euclidean distance between each point with a point k and drawing a median between both centroids, shown in the figure below-
We can clearly observe that the point to the left of the red line is near K1 or the blue centroid and the points to the right of the red line are near K2 or the orange centroid.
(3) How we need to find the closest point, we will repeat the process by choosing a new centroid. To choose the new centroids, we will calculate the center of gravity of these centroids and find new centroids as shown below:
(4) Now, we need to reassign each data point to a new centroid. For it, we have to repeat the same process of finding a medium line. The median will be as below:
In the picture above, We can see, an orange dot is on the left side of the line and two blue dots are right on the line. Then, these three points will be assigned to new centroids
We will continue to find new centroids until there are no different points on both sides of the line..
Now we can eliminate the assumed centroids, and the final two groups will be as shown in the image below
So far we have seen how the k-media algorithm works and the different steps involved to reach the final destination of the differentiating clusters.
Now everyone must be wondering how to choose the value of k number of clusters.
The performance of the K-means grouping algorithm depends largely on the pools it forms. Choosing the optimal number of clusters is a difficult task. There are several ways to find the optimal number of clusters, but here we are discussing two methods to find the number of clusters or the value of K which is the Elbow method and silhouette scoring.
Elbow method for finding 'k’ number of groups:
The Elbow method is the most popular for finding an optimal number of clusters, this method uses wcss (Sum of squares within conglomerates) representing the total variations within a cluster.
WCSS = ∑Pi in Cluster1 distance (PI C1)2 + ∑Pi in Cluster2distance (PI C2)2+ ∑Pi in CLuster3 distance (PI C3)2
In the above formula ∑Pi in Cluster1 distance (PI C1)2 is the sum of the square of the distances between each data point and its centroid within a group1 similarly for the other two terms in the above formula.
Steps involved in the elbow method:
- K- means that the grouping is done for different values of k (of 1 a 10).
- WCSS is calculated for each group.
- A curve is drawn between the WCSS values and the number of clusters k.
- The sharp curvature point or a weft point looks like an arm, then that point is considered as the best value of K.
So here, as we can see, a sharp curve is in k = 3, so the optimal number of groups is 3.
User ratings for silhouette Method to find 'k’ number of clusters
The silhouette value is a measure of how similar an object is to its own group. (cohesion) compared to other groups (separation). The silhouette varies from -1 a +1, where a high value indicates that the object corresponds well to its own group and not to neighboring groups. If most objects have a high value, then the grouping configuration is appropriate. Whether many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
Example showing how we can choose the value of 'k', since we can see that in n = 3 we have the maximum silhouette score, Thus, we choose the value of k = 3.
Advantages of using k-means clustering
- Easy to implement.
- With a large number of variables, K-Means can be computationally faster than hierarchical grouping (if K is small).
- k-means can produce higher groupings than hierarchical groupings.
Disadvantages of using k-means clustering
It is difficult to predict the number of groupings (K-value).
Initial seeds have a strong impact on the final results.
Practical implementation of the K-means clustering algorithm using Python (segmentation of banking customers)
Here we are importing the necessary libraries for our analysis.
Read the data and get the 5 better observations to take a look at the dataset
Code for EDA not included (Exploratory data analysis), EDA was performed with this data and an outlier analysis was performed to clean up the data and make it suitable for our analysis.
As we know, K-means are performed only on numerical data, so we chose the numerical columns for our analysis.
Now, to perform k-means grouping as discussed earlier in this article, we need to find the value of the number 'k’ of groupings and we can do it using the following code, here we use various values of k for grouping and then selecting using the Elbow method.
As the number of clusters increases, the variance (sum of squares within the conglomerate) decreases. The elbow in 3 O 4 groups represents the most parsimonious balance between minimizing the number of groups and minimizing variance within each group., so we can choose a value of k to be 3 O 4
Now it shows how we can use the silhouette value method to find the value of 'k'.
If we observe, we get the optimal number of clusters in n = 3, so finally we can choose the value of k = 3.
Now, adjust the algorithm of k means using the value of k = 3 and heat mapping for clusters.
Cluster 0: young customers who get low-credit loans for a short period of time
Group 1: Middle-aged customers who get high-credit loans for an extended period
Group 2: Elderly customers who get medium credit loans for a short period
We have discussed what clustering is, its types and their application in different industries. We discuss what k-means grouping is, the operation of the k-mean grouping algorithm, two methods to select the number 'k’ of groupings, and its advantages and disadvantages. Later, we went through the practical implementation of the k-media clustering algorithm using the problem of segmentation of bank customers in Python.
(1) img (1) to img (8) Y  , reference taken from “K-media clustering algorithm”