2

I know that for hierarchical clustering, it's the best practice to scale before so that you give the same weight to each variable. Otherwise, for the complete linkage, the variable with a wider range would be a dominant factor when determining which clusters to combine.

My question is, for the single linkage without scaling, is the variable with a wider range would be a dominant factor when determining which clusters to combine as the complete linkage? thanks in advance

Broele
  • 1,352
  • 1
  • 8
  • 15
user154385
  • 21
  • 1

1 Answers1

2
Brief Summary

Yes, a wider-range-variable would dominate the single linkage clustering without scaling.

Explanation

The tendency of wider-range-variables to dominate the clustering does not only apply to hierachical clustering, but to many clustering methods.

The reason for this lies below the clustering: most (if not every) clustering algorithm is based on a distance metric. If not otherwise specified, the euclidean distance is typically uses. And this metric is dominated by the wide-range variables. Hence, the clustering algorithms that rely on such a metric, are as well dominated by the wider-range-variables.

Normalizing is the easiest way to handle this problem (if it is a problem). Using different metrics would be another way. E.g. the Mahalanobis distance does kind of a normalization by it self. Another approach would be a custom metric that uses some domain knowledge.

Example

Do demonstate this, I created a example dataset with

  • wide-range y-axis and small-range x-Axis (left column)
  • normalized features (right column)

And clustered it with

  • complete linkage (top row)
  • singale linkage (bottom row) enter image description here

As you can see, the non-normalized data (left column) is clustered nearly exclusivly by the y-value.

You also see that the complete linkage prefers compact clusters (top row, both columns), while the single linkage avoids bigger gaps (bottom row, right image)

Broele
  • 1,352
  • 1
  • 8
  • 15
  • Hi Broele. Just have a question. You said at the end that complete linkage prefers compact clusters (top row, both columns), while the single linkage avoids bigger gaps (bottom row, right image). Can you please explain why? – user154385 Sep 12 '23 at 14:37
  • Single Linkage merges in each step the two clusters that have the closest pair of points (e.g. the smallest gap). It does not matter how long-stretched the clusters are, it just looks at the gaps between the clusters. That leads to the long-stretched long blue and orange clusters on the bottom-right. – Broele Sep 12 '23 at 17:41
  • Complete Linkage merges the two clusters that lead to the smallest diameter (diameter is the maximal distance between two points of the cluster), i.e. the resulting cluster should fit into a circle as small as possible. That probably leads to the weird outcome in the top-righ image. The right blue part will create a slightly bigger diameter if merged with the green part instead of merging it with the left blue part. – Broele Sep 12 '23 at 17:45
  • thank you, Broele. I don't still get why the complete clustering prefers compact cluster. The main reason is the two observation in the two clusters are far apart and they are the chosen ones that are going to get merged. Then, shouldn't we say the"loose cluster" which is the opposite of compact clusters? – user154385 Sep 12 '23 at 18:46
  • Basically, Single Linkage only considers the nearest two points between two clusters and Complete Linkage the two points that are furthest away. This means, that Single Linkage can build long lines, stars, filaments, ... as long as all points are close to their neighbors. For Complete Linkage the furthest distances in clusters are relevant and the clustering tries to keep these small. If the furthest distance is small, all distances between the points in one cluster are small. This make the cluster somehow compact. – Broele Sep 12 '23 at 19:08
  • I think I understand it now. – user154385 Sep 12 '23 at 22:06
  • can you please verify this? Complete linkage tends to produce more compact clusters.

    I guess it has something to do with not having a snowball effect. Unlike the single linkage where one big cluster eats up a little clusters that have either only one observation or a few observations. Then, the single linkage produces a loose cluster or one giant cluster that has a snowball effect.

    – user154385 Sep 13 '23 at 03:21
  • Complete linkage does not have a snow ball effect. So, the complete linkage does not have a few big clusters that eat up all the other small clusters or observations. Therefore, we will likely to see more compact clusters than single linkage. thank you for your help! – user154385 Sep 13 '23 at 03:22
  • 1
    I would not expect single linkage just to grow clusters, but also to merge clusters of equal size. Suggestion: create an own question for it and I will be happy to answer it and produce images to demonstrate what is going on – Broele Sep 13 '23 at 09:27
  • Then, I just need to understand these 2. Regarding the complete linkage, we say the two clusters that are going to be merged tend to make a new compact cluster if the distance between the 2 farthest observation in the 2 clusters that are going to get merged is small. But, if the distance between the 2 farthest observations in the 2 clusters is rally big, then we cannot say the complete linkage will lead to a compact linkage. Long story short, we have to give a condition that the distance of the 2 farthest observations is relatively close. Is that correct? – user154385 Sep 13 '23 at 20:37
  • I should say "complete linkage will lead to a compact cluster" instead of "complete linkage will lead to a compact linkage" – user154385 Sep 13 '23 at 20:47
  • Keep in mind that the hierarchical clustering always merges the two clusters with the best metric. That means, complete linkage creates (in a greedy way) clusters that are as compact as possible. Still, depending on the data, that does not need to mean much. If all points are aligned in a straight line then there is not much, complete linkage can do. – Broele Sep 14 '23 at 17:59
  • Pro Tip: You can run hierarchical clustering with the same data, but alter different numbers of clusters K=2,3,4.... Each reduction of the number of clusters by one should correspond to one merging step. This would show you for selected examples which clusters are merged. – Broele Sep 14 '23 at 18:10