9

This is a followup to this question. I am currently trying to implement the C-Index in order to find a near-optimal number of clusters from a hierarchy of clusters. I do this by calculating the C-Index for every step of the (agglomerative) hierarchical clustering. The problem is that the C-Index is minimal (0 to be exact) for very degenerated clusterings. Consider this:

$c = \frac{S-S_{min}}{S_{max}-S_{min}}$

In this case $S$ is the sum of all distances between pairs of observations in the same cluster over all clusters. Let $n$ be the number of these pairs. $S_{min}$ and $S_{max}$ are the sums of $n$ lowest/highest distances across all pairs of observations. In the first step of the hierarchical clustering, the two closest observations (minimal distance) are merged into a cluster. Let $d$ be the distance between these observations. Now there is one pair of observations in the same cluster, so $n=1$ (all other clusters are singletons). Consequently $S=d$. The problem is that $S_{min}$ also equals $d$, because $d$ is the smallest distance (that is why the observations where merged first). So for this case, the C-Index is always 0. It stays 0 as long as only singleton clusters are merged. This means the optimal clustering according the C-Index would always consist of a bunch of clusters containing two observations, and the rest singletons. Does this mean that the C-Index is not applicable to hierarchical clustering? Am I doing something wrong? I have searched a lot, but could not find any suitable explanation. Can someone refer me to some resource that is freely available on the internet? Or, if not, at least a book I may try to get at my universities library?

Thanks in advance!

  • Your observation is correct, but it is all fine with C-index. C-index is 0 when the observed clustering solution does not differ from the theoretically "ideal" best one under the given (observed) number of within-cluster distances. Consider a dataset which all consists of tight pairs of objects, and the pairs are quite far apart. Hierarchical clustering under virtually any linkage method will first - on initial steps - "collect" the objects into these pairs. And all that time C-index will remain 0. Later, the clustering will come to merge between the apart pairs: C-index will abrubtly worsen. – ttnphns May 08 '18 at 09:49
  • Algorithm to compute C-index is shown here https://stats.stackexchange.com/q/343878/3277. – ttnphns May 08 '18 at 09:50
  • P.S. Don't forget that C-Index is the lower (closer to 0) is the better! – ttnphns May 08 '18 at 10:39

1 Answers1

2

This may be one of the cases where there's more art than science to clustering. I would suggest that you let your clustering algorithm run for a short time before letting the C-Index calculations kick in. "Short time" may be after processing a few pairs, just when it starts to exceed 0, or some other heuristic. (After all you don't expect to stop at 1 or 2 clusters, otherwise a different separation algorithm may have been deployed.)

For a book recommendation, I can suggest:

You can scan/search the available contents on google books to see if it might meet your needs. It's worked as a reference for me in the past.

ars
  • 12,878
  • Oops, you're using agglomerative methods, so the "1 or 2 clusters" part doesn't make sense -- the "inverse" applies, you don't want n-1 or n-2 singletons, etc, i.e. letting clustering work for a bit before applying validity criteria shouldn't be problematic. – ars Sep 14 '10 at 01:29