2

I'm trying to find the optimal K-means clustering for a set of elements. For a particular K, K-means repeated several times does not always converge to the same clustering due to the randomness in the initialization. This means that whatever internal performance statistic (a clustering criterion, such as Dunn's index) I'm using to choose k is going to be dependent somewhat on when I run the algorithm, unless I use a seed. To decide whether the clustering is "stable" for a particular k (making it possible for me to believe the performance statistic is representative), I calculate how often the resulting clusters for that k overlap from run-to-run:

$$ \frac{1}{n(r-1)} \sum_{j=2}^r \sum_{i=1}^n \frac{|C_{j-1}^{(i)} \cap C_{j}^{(i)}|}{|C_{j-1}^{(i)}|}$$

Where $n$ is the number of elements I'm clustering, $r$ is the number K-means runs (with fixed k), and $C_j^{(i)}$ is the cluster from the $j$th run containing the $i$th element.

However, this statistic is not from the literature, and I'd be curious to know how others have addressed this problem and if there are issues with my approach that haven't occurred to me. Thanks.

  • I mean metrics such as Dunn Index, Beta CV, etc. that you can use to characterize the performance of K-means for some particular K. If a Kmeans run is heavily affected by the initial conditions, then I can't be sure that whatever statistic I get for a particular k is representative of all of the clusterings I would get for that k, were I to repeat the Kmeans runs with different initial conditions. So I want to answer the question, how dependent is the clustering on the initial conditions? – Matthew Plourde Mar 03 '16 at 19:22
  • @ttnphns I changed the word "metric" to "statistic". – Matthew Plourde Mar 03 '16 at 19:26
  • 1
    Well, there is two different topics here, better not to mix, generally. 1) Choice of the best initial k centres so that K-means come to the global optimum or very close to it; (2) Choice of the true or best k so that the clusters are most valid from the point of view of a particular clustering criterion. – ttnphns Mar 03 '16 at 19:34
  • 1
    The only method I have for choosing the initial k centers is randomly selecting k of the elements. The resulting clusterings for my particular data set appear to be rather sensitive to this choice of initial centers. I want to quantify how sensitive it is to the initial centers with the statistic above, following my intuition that for the "right" set of element features and k, I should see Kmeans converge to nearly the same clusters, regardless of the initial k centers. – Matthew Plourde Mar 03 '16 at 19:43
  • 1
    Note that K-means seeks to optimize this specific criterion: pooled within-cluster SS, and not Dunn's index or other similar. For every value of k in some range, we usually run K-means several times with different initial seeds for the k centres, then average the "corresponding", most similar final centres and use those k final averaged centres as the initial input for one last run. (to cont.) – ttnphns Mar 03 '16 at 19:47
  • (cont.) This almost guarantees you the solution close to the optimal for the specific k. In the end, you compare the different k solutions by a criterion you want as reasonable one. Select that solution (that k and its final solution) which is the best or is most interpretable. – ttnphns Mar 03 '16 at 19:47
  • That really sounds like a much better approach. Thanks. – Matthew Plourde Mar 03 '16 at 19:50
  • 1
    I see some sense in your reasoning in the previous comment. However, it is simpler and more effective, instead of quantify how sensitive it is... just to do your best that the solution be close to optimal. Then, having that good results at hand, select the best k among them, as I described. – ttnphns Mar 03 '16 at 19:55
  • @ttnphns: My understanding is that there is not generally a natural one-to-one relationship between elements in two sets set in $\mathbb{R}^n$, so how do you find the "corresponding" centres? – naught101 Mar 30 '17 at 03:39

1 Answers1

2

There are many statistics suitable for comparing the similarity of clusterings. Probably the most popular is the adjusted Rand index.

If you have multiple runs of k-means and the results have a high ARI (close to 1), then they are very similar. An ARI close to 0 would indicate very unstable results; but this is essentially impossible with k-means because it only generates convex clusterings which are bound to be rather similar (an ARI of 0 would arise on random cluster labels that do not take the data into account).

You may want to also look at the WCSS used by k-means. Results with a low ARI but a highly similar WCSS indicate that there a multiple equally good local minima. Then k may be too small. A high ARI similarity but very different WCSS must be caused by only a few points, probably outliers that disturb the algorithm?