Hierarchical Clustering (Agglomerative)

Rahull Trehan
5 min readJun 3, 2021

Hierarchical Clustering is one of the most extensively used clustering techniques in unsupervised machine learning.

It is a bit different from K-Means Clustering and we will explore the main core concept behind it and the key differences between various types of hierarchical clustering.

Clustering is a technique that is used in unsupervised machine learning where we do not have any idea about the labels or rather ignore the labels and just try and cluster the points to make sense out of it or to understand the behavior of each cluster. There are a lot of times when you would come across a real business case wherein you have been asked to segregate the data or form the segments of customers. Now since we do not have the understanding of how two segments behave differently basis the variables we have at hand. We can perform various clustering techniques and try and create these segmentations.

Hierarchical clustering, as the name suggests, builds a hierarchy of clusters. It builds quite good visualizations which can give you a very informative representation of the clusters, but you can always modify or slice it depending on how many clusters you are intending to form out of your dataset at the end.

Let's jump onto different types of hierarchical clustering. There are two main types of hierarchical clustering:

  1. Agglomerative (bottom-up)
  2. Divisive (top-down)

In this article, we are going to discuss the agglomerative clustering methods which are also known as AGNES (Agglomerative Nesting) which follows a bottom-up approach while clustering. It starts with the basic idea, that each point in the dataset is a cluster of its own and the algorithm then starts clubbing these points together to form bigger clusters (bottom-up).

There are four ways through which this bottom-up clustering can happen (we also call them as linkages) and in all of these approaches, the only thing which differs is that how are we going to measure the distance between the clusters and whichever clusters have the minimum linkage distance forms the new cluster.

  • Single Link — looks at the shortest distance between the two points in each cluster
  • Complete Link — looks at the longest distance between the two points in each cluster
  • Average Link — looks at the average distance between each point in one cluster to every point in the other cluster
  • Ward’s Method — looks at minimum variance between the two clusters and it is the default approach if you run an AgglomerativeClustering algorithm using Scikit Learn.

It might seem a little confusing to understand how different linkages can end up in different clusters. Let’s visualize each of these linkages for a better understanding of the topic.

Single Linkage

Single Linkage (the closest distance between two points in each cluster)

Single Linkage is the simplest of the linkages. It starts with the base rule that each point in the dataset is its own cluster. Then it looks for the closest of the two points and forms the bigger cluster and we can see that in the second box in the above diagram, points 6 & 5 formed the first cluster. Since it is an iterative process, it will look for the next two closest points and form other clusters.

An important thing to observe here is how the system is considering the type of linkage to cluster the points. Now in the bottom right box of the above diagram, we can see that we have 4 clusters already and now the question is how these clusters would further merge to form new clusters.

Now since in Single Linkage, we are looking for the shortest distance and since the shortest distance between cluster B and C (determined by the distance between point 1 and 3) was lesser than A and C (determined by the distance between point 4 and 6) hence cluster C formed a newer cluster with B.

Complete Linkage

Complete Linkage (the farthest distance between two points in each cluster)

It is very similar to Single linkage, however, the only basic difference is that this linkage looks at the farthest distance between the two points in each cluster, and wherever this farthest point distance is minimum forms the cluster.

If you observe the distance between the two farthest points in Cluster B and Cluster C (which is the distance between point 2 & 4) and the distance between the two farthest points in Cluster C and Cluster A (which is the distance between point 4 & 5), since the distance between 4 & 5 is lesser than 2 & 4, Cluster C forms a new cluster with Cluster A.

Average Linkage

Average Linkage (the average distance between each point in one cluster to every point in the other cluster)

Again works very similar to Single and Complete Linkage but this linkage calculates the average distance between each of the points of the two clusters and wherever this average distance is minimum forms the new cluster.

The average distance between Cluster B & Cluster C can be calculated by:

And the average distance between Cluster C & Cluster A can be calculated by:

And since the average distance between Cluster B & C is lesser than that of Cluster C & Cluster A, Cluster C forms the new cluster with Cluster B.

Ward’s Method

Ward’s Method (variance between the two clusters)

This is the most complex linkage out of all of the linkages discussed till now. In this method the first step is to find out the center point between the two clusters and then calculate the variance it will cause if the two clusters are merged. The main aim of this method is to find the clusters where this variance is minimum and merge them to form one cluster. As seen in the diagram above, I have purposely zoomed in on Cluster A & Cluster D for better understanding, we can calculate the variance between the two clusters by first finding the center point ‘x’ and then calculate the distance of all the points from this ‘x’, square them and then subtract the inter-cluster variance which is calculated by adding the distance of the individual cluster points and the center point within the cluster.

In our case, we can calculate the variance by:

So wherever the delta (∆) is minimum the clusters will get merged to form a new cluster.

Hope now you have a better understanding of Agglomerative Hierarchical Clustering. Thank you for reading.

--

--