3

So I have these data points and I know the ground truth/labels of these points. I want to use Hierarchical clustering on the dataset given that all of the points that have the same labels are clustered together. I know this somewhat defeats the purpose of clustering where I use a clustering method on a dataset and compare my results with the ground truth. However, I want to use hierarchical clustering to determine the relationship between these predefined clusters and see how these clusters can converge. Any tips on doing this? So I tried looked into R and Matlab to see if there are any packages that allows me to do this, but it doesn't look like I can.

mtber75
  • 261
  • Sorry, your question is not clear. In what way are you going to use information about predefined classes (clusters) during cluster analysis? And what specifically do you mean under to determine the relationship between these predefined clusters? – ttnphns Nov 27 '13 at 11:13
  • Sorry about that. So when I am using hierarchical clustering, I do not want to initially first cluster all the data points individually. Instead, I want to first start off with clusters that are grouped together by their labels. Basically I want to use the bottom up hierarchical approach to cluster these groupings together and this is the relationship that I wanted to see. – mtber75 Nov 27 '13 at 12:47
  • So, it is like as if you have already clustered objects into a few clusters and stopped. Now you want to "go on" from where you "stopped" to continue the agglomeration up. Is that so? – ttnphns Nov 27 '13 at 13:52
  • Yes that is correct – mtber75 Nov 27 '13 at 15:58
  • Is there a technical term for this? – mtber75 Nov 27 '13 at 17:20
  • Well, your question is very good indeed. This is actually a case of constrained hierarchical clustering. Right now I don't think I'm ready to answer it. I must meditate on the agglomerative algorithm in order to be able to say whether I think you can or cannot find some work around and use standard clustering routine, to satisfy what you want. – ttnphns Nov 27 '13 at 17:31
  • A quick idea coming is to set all within-class distances ("class" is points with the same label) in the distance matrix to 0. So that they won't be ever separated. But this idea may be incompatible for some hierarchical methods. One should think out it. – ttnphns Nov 27 '13 at 18:17
  • My opinion is that the just above zeroing trick (workaround) may be used with "non-geometric" hierarchical methods only: single, complete, average linkages. Zeroing makes the distance matrix non-euclidean, and these methods are valid for non-euclidean measures. Also, among average linkages, the so called within-group average method (found for example in SPSS) should not be used with zeroing. – ttnphns Nov 28 '13 at 07:24

1 Answers1

1

This should be easily possible, as long as you have some software which is very flexible.

I'd try ELKI, and implement my own distance function:

$$ d(x,y)=\begin{cases} 0 & \text{iff }x\text{ and }y\text{ are in the same class} \\ \varepsilon+\text{Euclidean}(x,y) & \text{otherwise} \end{cases} $$

Then hierarchical clustering should first merge all objects of the same class, and construct a hierarchy of the existing classes.

Note that hierarchical clustering scales quite badly, it's in $\mathcal{O}(n^3)$ when implemented naively (ELKI has a $\mathcal{O}(n^2)$ implementation for single-linkage). You may actually get a substantial speedup by re-implementing hierarchical clustering yourself for your particular use case; by starting at the point where the same-label clusters have already been formed.