Heirarchical Clustering Techniques using R

The idea behind hierarchical cluster analysis is to show which of a (potentially large) set of samples are most similar to one another and to group these similar samples in the same the limb of a tree.

We have already learned k-means clustering using R and types of clustering analysis
Each of the samples can be thought of a sitting in an m-dimensional space, defined by the m variables (columns) in the data frame. We define similarity on the basis of the distance between two samples in this m-dimensional space.

Several different distance measures could be used, but the default is Euclidean distance and this is used to work out the distance from every sample to every other sample. for the other options, check

?dist

This quantitative dissimilarity structure of the data is stored in a matrix produced by the “dist function”.
Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
for more details about the hclust function, check

?hclust

Here in this example, we will cluster the similar countries on the basis of similarity. so the decision making can be easier. In order to find the similarities in observation and group the data, we need to perform cluster analysis.

country area gdp inflation life expect military pop growth unemployment
Austria 83871 41600 3.5 79 0.8 0.5 4
Belgium 95326 37589 3.5 78 1.3 0.4 2
Bulgaria 56356 13456 2.6 78 2.3 0.3 3
Crotia 73569 18000 4.5 79 1.5 0.2 5
czech Republic 43568 27156 4 78 1.6 -2 1
denmark 338155 37256 2 56 4 2 1.5
Estonia 152632 20156 3 78 2 1.9 4
Germany 132562 36252 4.9 74 2 1.8 3
Hungary 93265 38265 5.9 69 3.1 1.5 3.5
Iceland 100000 25655 1.5 65 4 1.2 3.6
Italy 70125 19654 2.8 86 2 -0.8 2.5
Latvia 302325 38569 3.6 72 1.2 1.9 4
Lithuansia 64523 40256 5.6 88 1.3 -1.5 4.01
Luxemberg 65235 32565 4.5 98 1.5 1.6 1.8
Netherland 41256 12568 2.6 67 1.4 0.6 2.5
Norway 326598 19568 7.2 73 1.69 0.3 1.23
Portugal 312654 18652 1.53 74 2.6 -1.2 1.6
Slovakia 92356 45895 0.26 72 3.1 0.6 5
Slovenia 49265 123654 2.25 75 1.5 0.5 6
Spain 20125 26651 23.5 76.5 2 0.5 4.2
Sweden 502354 21561 26.2 86.3 1.9 -0.2 2.356
Switzerland 495632 125465 56 56.9 1.8 0.003 1.8

 

In this example, we will use a hierarchical cluster analysis to group the countries. This cluster analysis also allows us to summarise the data by grouping all the similar observation into different clusters. These observations are made by considering similar values for the number of variables.  i.e. if the eucladien distance between two values is less than they are group together we can perform cluster analysis with the dist and hclust function.

dist function:- calculates a distance matrix of the provided values and provides the eucledian distance between those values by default. from the calculated eucladien distance hierarchical clustering can be derived, to perform this we use the hclust function.

The hclust function has a method attributes that specifies hows the clustering is to be done. The method includes average, gord, single, median complete and centroid methods. The complete linkage method being the default.

Steps to make hierarchical clustering in R
Step1: First, we load the dataset in R workspace and saved it in variable name- data
survey<-read.csv("survey.csv", header=TRUE)>

Step2: the syntax to perform hierarchical cluster is hclust of dist of the dataset name
surveyclust<-hclust(dist(survey[-1]))
Saving the hclust in variable name surveyclust -1 is to remove the first column i.e country name, since it does not have the logical relationship with the data.

step3: plot the denddogram
plot(surveyclust)
clustering result variable to plot the dendrogramhclust
the numbers you are seeing on dendrogram plot is country in the table
countries are plot based on their similarities

Step 4: we can also make clusters from these dendrograms using
rect.hclust(survey, 5)model name and number of argument(number of clusters)
the dendrogram now will show 5 clusters in color.
rect.hclust(survey, 4)
the dendrogram now will show 4 clusters in color.

You might also like More from author