K Means Clustering In R

K Means Clustering is used when the input data is unlabeled and we have to find hidden patterns or clusters in the data set unsupervised learning comes into the picture. In clustering what we do is given a data set we look for similarities among the data points and make clusters accordingly. The clusters are formed so that data points in the cluster are similar to each other and differ significantly from points in another cluster.

Suppose Meera is a data science student her teacher asks her to make groups of students. Meera can make groups in various ways like

  • She can simply divide the students according to gender.
  • Students can be grouped according to the grades secured by them in last year.
  • According to hobbies like dancing, singing or anything else.

This is how Meera used clustering. However, Meera decides to apply k means clustering for making groups. Let’s see how Meera does it. Suppose she has a data of marks of students in the first exam.

It’s as follows:
35,40,72,65,82,73,90,46,87,92,45,76,56,58,78. She decides to make 3 groups.

Step 1: Since she has decided to make 3 groups she randomly picks 3 centroids from the data as C1 =40, C2=56, C3=87.

Step 2: She calculates the distance of each data point from the three centroids and assigns the data point to that cluster for which the distance of the data point from the centroid corresponding to that cluster is minimum.

Following the same thing for each data point she gets the following three clusters Cluster 1: 35, 40, 46, 45 Cluster 2: 56, 65, 58 Cluster 3: 72, 82, 73, 90, 87, 92, 76, 78

Step 3: Calculating averages of each cluster she gets the new centroids to be as follows C1=41.5, C2=59.67, C3= 81.75 For new centroids she repeats step 2 and obtains new clusters as below Cluster 1: 35, 40, 46, 45 Cluster 2: 56, 65, 58 Cluster 3: 72, 82, 73, 90, 87, 92, 76, 78 The clusters are same as obtained in step 2 so the procedure stops here. This is how Meera found 3 groups using k means clustering. Let’s have a look at the algorithm of k means clustering

Step 1: choose k (i.e. the number of clusters in which you want to divide your data). Sometimes deciding the value of k cannot be done by just looking at it at that time we go for elbow method. It’s discussed after the algorithm.

Step 2: Calculate the distance of each data point from the centroids. Assign the data point to that cluster for which distance between centroid and data point is minimum.

Step 3: Once all data points are assigned calculate the average of each cluster take them as new centroids then go to step 2. Continue this loop until clusters stop changing. The problem of deciding value of k can be solved using elbow method. In this method, we calculate the sum of squares due to error [SSE i.e. the sum of squares of distances (here we consider Euclidean distance) between the data points and the centroids of their corresponding clusters] for a range of k values then plot number of clusters against SSE. The graph obtained looks like an arm.

The elbow on arm is the best value of k. Though SSE decreases as the number of clusters increases we cannot use those values to be the best value of k since for large values of k the data points are assigned to their own clusters.
Let’s see how to do k means clustering using R. Iris data is somewhat easy to understand so I’m going to use it here. Let’s store the data set in the variable!

To have a look at some rows in the data we can use the command head()

Since we apply k means algorithm to find natural groupings in the data we assume that we don’t know how the data is classified and now we are finding it so let us remove the last column

Now let us see how the new data looks

Let us apply k means clustering to this data. Applying k means clustering using R is somewhat easy you just have to use the kmeans() commands.

Here we already knew that there are 3 clusters in the data. What if we don’t know in how many clusters we should divide the data? as discussed above in such situations we use elbow mothod. For applying it we find SSE and then plot k against SSE and take elbow on the arm as k

Kmeans clustering
We choose the value of k as the elbow on the arm, however, we can see two such points in the above graph one at 2 and another at 3 we go for the one with small SSE so k=3 is more accurate. Thus we select k as 3 using the elbow method.

You might also like More from author