K stands for simplified grouping in Python

Contents

Overview

  • What is K stands for Clustering?
  • Implementation of K stands for Clustering
  • WCSS and elbow method to find the number of clusters
  • Python Implementation of K Means Clustering

K means is one of the most popular unsupervised machine learning algorithms used to solve classification problems. K Means segregates the unlabeled data into multiple groups, called clusters, based on having similar characteristics, common patterns.

46668k-means-clustering-algorithm-in-machine-learning-3507448

Table of Contents

  1. What is clustering?
  2. What does the K algorithm mean?
  3. Schematic Implementation of KMeans Clustering
  4. Choose the correct number of clusters
  5. Python implementation

1. What is clustering?

Suppose we have a number N of unlabeled multivariate data sets of various animals such as dogs, gatos, birds, etc. The technique for segregating data sets into multiple groups, based on having similar characteristics and characteristics, it is called clustering..

The groups that are formed are known as Clusters. Clustering technique is being used in several fields, such as image recognition, spam filtering

Clustering is used in unsupervised learning algorithm in machine learning as multivariate data can be segregated into multiple groups, without any supervisor, based on a common pattern hidden within data sets.

2. What does the K algorithm mean?

The Kmeans algorithm is an iterative algorithm that divides a group of n data sets into k subgroups / clusters based on similarity and their mean distance from the centroid of that subgroup / formed in particular..

K, here is the predefined number of clusters that the algorithm will form. If K = 3, means that the number of clusters that will be formed from the data set is 3

Steps of the mean K algorithm

The operation of the K-Means algorithm is explained in the following steps:

Paso 1: Select the value of K to decide the number of clusters to form.

Paso 2: Select K random points that will act as centroid.

Paso 3: Map each data point, based on its distance from randomly selected points (centroid), to the nearest centroid / close that will form the predefined groups.

Paso 4: place a new centroid of each group.

Paso 5: Repeat step 3, which reassigns each data point to the new closest centroid of each group.

Paso 6: If any reassignment occurs, go to step 4; on the contrary, go to step 7.

Paso 7: END UP

3. Schematic implementation of K means clustering

PASO 1:Let's choose the number k of clusters, namely, K = 2, to segregate the dataset and place it in different respective clusters. We will choose some 2 random points that will act as centroids to form the group.

PASO 2: Now we will assign each data point to a scatter plot based on its distance from the nearest K point or centroid.. It will be done by drawing a median between both centroids. Consider the following image:

PASO 3: the points on the left side of the line are near the blue centroid and the points on the right of the line are near the yellow centroid. The one on the left forms a group with a blue centroid and the one on the right with a yellow centroid..

PASO 4:repeat the process choosing a new centroid. To choose the new centroids, we will find the new center of gravity of these centroids, shown below:

PASO 5: Then, we will reassign each data point to the new centroid. We will repeat the same previous process (using a median line). The yellow data point on the blue side of the median line will be included in the blue group

19873tb6-8189281

PASO 6: As the reallocation has been made, we will repeat the previous step of finding new centroids.

38346tb5-8462174

PASO 7: We will repeat the previous process of finding the center of gravity of the centroids, as it's shown in the following.

49556tb4-5325801

PASO 8: After finding the new centroids, we will draw the median line again and reassign the data points, as in the previous steps.

35981tb3-3268103

PASO 9: Finally, we will segregate points based on the median line, so that two groups are formed and no different points are included in a single group.

40672tb2-7670322

The final group that is being formed is the following

96395pq-2152300

4. Choose the correct number of clusters

The number of clusters we choose for the algorithm must not be random. Each and every cluster is formed by calculating and comparing the mean distances of each data point within a cluster from its centroid.

We can choose the correct number of clusters with the help of the sum of squares method within the cluster (WCSS).

WCSS Represents the sum of the squares of the distances of the data points in each and every group from its centroid.

The main idea is to minimize the distance between the data points and the centroid of the clusters. The process is iterated until reaching a minimum value for the sum of distances.

To find the optimal value of clusters, elbow method follow below steps:

1 Run clustering of K-means on a given data set for different values ​​of K (that go from 1 al 10).

2 For each value of K, calculates the WCSS value.

3 Draw a graph / curve between WCSS values ​​and respective number of clusters K.

4 The acute point of curvature or a point (that looks like an elbow joint) of the frame like an arm, will be considered as the best / optimal K value

5. Python implementation

Import relevant libraries

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

Loading the data

data = pd.read_csv('Countryclusters.csv')
data
63374snip1-8657293

Graph the data

plt.scatter(data['Longitude'],data['Latitude'])
plt.xlim(-180,180)
plt.ylim(-90,90)
plt.show()
48681download208-2988085

Select function

 x = data.iloc[:,1:3] # 1t for rows and second for columns
x
46942snip2-9206832

Grouping

kmeans = KMeans(3)
means.fit(x)

Clustering results

identified_clusters = kmeans.fit_predict(x)
identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters 
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_with_clusters['Clusters'],cmap='rainbow')
73075download209-8093241

Try a different method (to find in. Of groups to select)

WCSS and elbow method

wcss=[]
for i in range(1,7):
kmeans = KMeans(i)
kmeans.fit(x)
wcss_iter = kmeans.inertia_
wcss.append(wcss_iter)

number_clusters = range(1,7)
plt.plot(number_clusters,wcss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
38785download2010-9638209

we can choose 3 of course. of conglomerates, this method shows what is the good number of clusters.

With this I end this blog.
Hi everyone, Namaste
My name is Pranshu Sharma and I am a data science enthusiast
Thank you very much for taking your valuable time to read this blog.. Feel free to point out any mistakes (after all, i am an apprentice) and provide the corresponding comments or leave a comment.
Dhanyvaad !!
Feedback:
Email: [email protected]

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.