# Hierarchical Clustering Using Python and Scipy

**Table of contents**

**Introduction to Hierarchical Clustering Using Python****Types of hierarchical clustering****Pseudocode****Linkage measures****Space and Time complexity****Implementing Hierarchical clustering in Python****Advantages and Disadvantages****Applications**

**Introduction**

A cluster, in other words, is called a class or a cluster. It is the process of groupings similar objects in one cluster. Clustering is a type of multivariate statistical analysis also known as cluster analysis or unsupervised classification analysis. It is based on a mathematical formulation of a measure of similarity.

**Hierarchical clustering** takes the idea of clustering a step further and imposes an ordering on the clusters themselves. If you think about the file arrangement in your personal computer, you will know that it is also a hierarchy. Your hard disk is divided into various drives. Each drive contains various folders, an opening which reveals more folders until a point. Hierarchical clustering also uses the same approaches where it uses clusters instead of folders.

**Types of Clustering**

**Agglomerative clustering**(Bottom-up approach): each sample is treated as a single cluster and then successively merge (or*agglomerate*) pairs of clusters until all clusters have been merged into a single cluster depending upon smallest differences of parameters like Euclidian.

**Divisive clustering (top down)**: a single cluster of all the samples is portioned recursively into two least similar cluster until there is one cluster for each observation. The fact is divisive clustering produces more accurate hierarchies than agglomerative clustering in some circumstances but is conceptually more complex. The divisive clustering algorithm is exactly the reverse of Agglomerative clustering. So let’s discuss agglomerative clustering example in detail.

Let’s try to understand it with an example. Let’s say there are 6 samples and you need to cluster them based on some potential parameter like Euclidian distance. As you can see points c and b can be grouped into one cluster and points **d** and **e** has another cluster. Also points **a** and **f** as different individual clusters. Hence you can group them as shown in the following figures.

By looking at the dendrogram, you can choose the clusters as either 2 or 3 depending upon the threshold value. But generally we choose the midpoint of the longest branch as the threshold and hence we have 3 clusters.

**Pseudocode for Hierarchical Clustering**

- Compute the distance matrix between each input data points.
- Consider each data point as a cluster.
- Repeat :
- Merge two closest clusters.
- Update the distance matrix.

- Repeat step 3 until one single cluster remains.

**Linkage measures**

There exists a lot of methods to measure the distance between two clusters.

**Single link distance**: Single link distance is defined as the minimum distance between two points in each cluster.**Example:**The distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.**Advantages and disadvantages:**- Can handle non-elliptical shapes.
- Produces long, elongated clusters.
- Sensitive to outliers and noise.

**Complete link distance:**Complete link distance is defined as the maximum distance between two points in each cluster.**Example:**the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.**Advantages and disadvantages:**- Produces more balanced clusters (with equal diameter).
- Less susceptible to noise.
- Often breaks very large clusters.
- Small clusters are merged with large ones.

**Average link distance:**the average distance between each point in one cluster to every point in the other cluster.- Example the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.
**Advantages and disadvantages:**- Less susceptible to noise and outliers.
- Biased towards globular clusters.

**Centroid distance:**distance between centroid r_{j}of Cluster C_{j}and centroid r_{j}of Cluster C_{j}**Ward’s distance:**Ward’s distance between the clusters Ci and Cj is the difference between the total within-cluster sum of squares for the two clusters separately, and the within-cluster sum of squares resulting from merging the two clusters in cluster Cij.where r_{i}, r_{j}, r_{ij}is the centroid of C_{i}, C_{j}, C_{ij}.

There are multiple metrics for deciding the closeness of two clusters like squared Euclidian distance, Manhattan distance, Mahalanobis distance etc.

**Comparison:** if you study the following figure carefully, you will find that different distance measures led to different clusters formation.

**Space and Time complexity**

**Space:**it requires O(n^{2}) space for storing the distance matrix.**Time:**O(n^{3}) in most cases.- There are n steps and at each step distance matrix of size, n
^{2 }must be updated. - Time complexity can be reduced to O(n
^{2 }log n) by using appropriate data structures.

- There are n steps and at each step distance matrix of size, n

**Implementing Hierarchical clustering in Python**

1 2 3 4 |
from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage import numpy as np |

1 2 3 4 5 6 7 8 9 10 11 |
# generating two clusters: x with 10 points and y with 20: np.random.seed(1234) x = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,]) y = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,]) X = np.concatenate((x, y),) print(X.shape) # 150 samples with 2 dimensions plt.scatter(X[:,0], X[:,1]) plt.show() (30, 2) |

1 2 3 4 |
# generate the linkage matrix Z = linkage(X, 'ward') #print(Z) |

1 2 3 4 5 |
from scipy.cluster.hierarchy import cophenet from scipy.spatial.distance import pdist coph_dists = cophenet(Z, pdist(X)) #coph_dists |

1 2 3 4 5 6 7 |
plt.figure(figsize=(25, 13)) plt.title('HCA Dendrogram') plt.xlabel('sample index') plt.ylabel('distance') dendrogram(Z,leaf_rotation=90,leaf_font_size=12,) plt.show() |

**Advantages and Disadvantages of**

**Advantages: **

- No prior information about the number of clusters is required.
- Easy to implement and gives the best result in most cases.

**Disadvantages: **

- The minimum time complexity of O(n
^{2}log n) is required (where n is the number of samples). - It can never undo what has done previously.
- Difficulty in breaking large clusters.
- Sensitive to noise and outliers.
- In case of a large number of clusters, it is difficult to identify them in the dendrogram.

**Applications**

- Recommendation engines
- Social media analysis
- Search result grouping
- Image segmentation