How to determine the number of clusters when using correlation as the distance?

Question

How does using 1 - correlation as the distance influence the determination of the number of clusters when doing kmeans? Is it still valid to use the classical indices (Dunn, Davies-Bouldin...)?

ttnphns · Accepted Answer · 2012-04-05T17:23:27.010

First. It is odd to use $1-r$ distance with K-means clustering, which internally operates with euclidean distance. You could easily turn r into true euclidean d by the formula derived from cosine theorem: $\sqrt{2(1-r)}$.

Second. I wonder how you manage to input distance matrix into K-means clustering procedure. Does R allow it? (I don't use R, and the K-means programs I know require casewise data as input.) Note: it is possible to create raw casewise data out of euclidean distance matrix.

Third. There is a great number of internal "clustering criterions" (over 100 I believe) helpful to decide what cluster solution is "better". They differ in assumptions. Some (like cophenetic correlation or Silhouette Statistic) are very general and can be used with any distance or similarity measure. Some (like Calinski-Harabasz or Davies-Bouldin) imply euclidean distance (or at least metric one) in order to have geometrically sensible meaning. I haven't heard of Dunn's index you mention.

P.S. Reading Wikipedia page on Dunn index suggests that this index is of general type.

(+1) I can confirm that R's kmeans() expects raw data. Also, using an euclidean metric with k-means clustering based on standardized values (z-scores) is equivalent to relying on Pearson's correlation distance. — chl, Apr 05 '12 at 16:48
Seems reasonable. When variables are z-scores, squared euclidean distance is exactly 2(N-1)(1-r). — ttnphns, Apr 05 '12 at 16:58
tnx for this! it really cleared the stuff for me. I used the Kmeans from the amap package which supports correlation as the distance parameter. — user680111, Apr 06 '12 at 00:31

score 0 · Answer 2 · answered Apr 05 '12 at 23:34

0

Note that k-means is designed for Euclidean distance. The mean may or may not be an appropriate estimator for the cluster center with other distances. So be careful when using other distance functions with k-means. Consider using a more modern clustering algorithm!

answered Apr 05 '12 at 23:34

Has QUIT--Anony-Mousse

42,358

Could you please point me to a reference for more modern algorithms? I have tried till now SOM and kmeans, but the problem is that I do not know the true number of clusters, so I resorted to kmeans, just because it is faster. (data represents time-series gene expression) – user680111 Apr 06 '12 at 00:39
DBSCAN isn't the newest, but clearly doesn't have any requirements on the distance function. Ideally, you should use an index for acceleration, but I'm not sure if correlation distances can be indexed. other than that, read up on the literature. There must be 100 clustering algorithms by now. – Has QUIT--Anony-Mousse Apr 06 '12 at 03:38

How to determine the number of clusters when using correlation as the distance?

2 Answers2

Linked