Steps Of K-Means Clustering In R
Here in this article, we will learn steps of K-Means Clustering in R. Have you observed, at a restaurant, you usually tag people with coats and laptop cases as business executives, teens carrying books and wearing casual dresses as college students? This is nothing but spotting patterns in a group of people. Consider, data of automobile drivers, for which we have information such as Car’s age, type and Driver’s age, gender, education, location. We want to segment drivers into groups to estimate the claim frequency.
Clustering can be used to improve predictive accuracy by segmenting databases into more homogeneous groups. Then the data of each group can be explored, analyzed, and modeled.
Clustering is used to classify items or cases into relatively homogeneous groups called clusters and objects in one cluster tend to be similar to each other and dissimilar to objects in the other clusters.
K-Means Clustering groups items or observations into a collection of K clusters and the number of clusters, K, may either be specified in advance or determined as a part of the clustering procedure. K-Means clustering has been included in the Machine Learning section of CS2 (Risk Modelling and Survival Analysis). Let’s have a look at the procedure and how it’s applied in R.
The procedure can be summarized as follows:
1. Partition the items into K initial clusters, where K is any initial estimate of the number of clusters which can be determined according to the business requirements. Alternatively, it can be determined by using the elbow method (which is a widely used technique), which will be discussed further in this article.
2. Euclidean distance with either standardized or unstandardized observations is calculated. Assign an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item.
3. Repeat Step 2 until no more reassignments take place.
Let’s have a look at K-Means Clustering on the Wholesale Customer dataset (ref. UCI Machine Learning Repository) using R.
- FRESH: annual spending on fresh products (Continuous);
- MILK: annual spending on milk products (Continuous);
- GROCERY: annual spending on grocery products (Continuous);
- FROZEN: annual spending on frozen products (Continuous)
- DETERGENTS_PAPER: annual spending on detergents and paper products (Continuous)
- DELICATESSEN: annual spending on and delicatessen products (Continuous)
STEPS OF K-MEANS CLUSTERING IN R
K-Means function can be used from the ‘stats’ package in R. You can install and call the package by the following steps:
Step 1: Read the data using import dataset or read.csv( ), and assign it to data1.
Step2: Getting the descriptives of the data using summary( ) in R
Step 3: Here, we observe that the data has a large range of values for some variables as compared to others. The variables with a larger range of values tend to dominate, so we standardize all the variables so that each uses the same range. We rescale the variables so that they have a mean of 0 and a standard deviation of 1.
A large z-score implies that observation is far away from the mean in terms of standard deviation, eg. A z-score of 3 means that the observation is 3 standard deviations away from the mean.
We rescale the data using scale( ) in R.
data1 <- scale(data1)
Step 4: Now we need to find the optimal number of clusters, K. The elbow method analyses how the homogeneity or heterogeneity within the clusters changes for various values of K. Homogeneity within clusters usually increases as additional clusters are added and heterogeneity decreases. The goal is to find that value of K beyond which there is negligible gain in information. If one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, which is indicated by an elbow in the curve, thus called the Elbow method. We do this in R using a function which gives within sum of squares for different values of no. of clusters.
withss <- sapply(1:10,
kmeans(data1, k, nstart = 50, iter.max = 15)$tot.withinss})
Plotting Within Sum of squares v/s Number of clusters, using plot( ) in R.
plot( 1:10, withss, type = “b”, pch = 19, frame = FALSE, xlab = “Number of clusters”, ylab = “Within Sum of squares” ) axis(1, at = 1:10, labels = seq(1, 10, 1))
In figure 1, we observe that there is an elbow at 2 and 5 number of clusters. On analysis for values of K ranging 2 to 5, we observe that the optimal no. of clusters is 3 for understanding the optimal customer segmentation. The 5 cluster solution gives a more detailed customer segmentation, but at this stage, we’ll have a look at the 3 cluster solution. Thus, work out cluster analysis for 3 clusters using kmeans( ) in R:
clust_output <- kmeans(data1, centers = 3)
Step 5: Analyzing the cluster analysis output,
There are 3 clusters of sizes 49, 347 and 44 respectively. The cluster centers give us insights about the cluster description.
Cluster 1 has highest spenders on Fresh, Frozen products and Delicatessen products. This cluster consists of consumers who spend more on fine foods and are high spenders.
Cluster 2 has low spenders across all products.
Cluster 3 has highest spenders on Milk, Grocery, Detergent and Paper. This cluster consists of consumers who spend majorly on domestic and household products.