4

The rule of thumb on choosing the best k for a k-means clustering suggests choosing $k$

$$ k \sim \sqrt{n/2} $$

$n$ being the number of points to cluster. I'd like to know where this comes from and what's the (heuristic) justification. I cannot find good sources around.

The only references I can find about this are a comment on reserchgate and this review, which does not explain it anyway.

3 Answers3

4

For the purpose of data approximation, this value can give you desired properties. Computing the pairwise distances of $\sqrt{n/2}$ takes approximately linear time, so if you want to reduce the size of your data set, this can be a value of interest.

For actual clustering, the value usually is unreasonably large.

1

I'm not sure if there is a "best" answer to this-I could only find a few references to your rule and no underlying theory. I went through some of the Springer texts (ISLR and ELSL) here on my laptop and the chapters mention K means reference there are ways to choose k-but there is no consensus on the matter.

There is just a single reference to additional material on the subject (Hastie et al. (2009)) in ISLR. It appears that this method might begin with assigning p values to your clusters, but the details are a bit thin and I have yet to open that part up... However that might be a place to start!

  • I went through the ELSL and found no reference to this specifically either. The point here is not about the best answer, I'd just like to understand the gist of where that estimation comes from. – martina.physics May 01 '17 at 21:42
  • I'm sort of unfamiliar with this estimation myself! Do you recall where you have seen it?

    Also I made a typo in my reply. Sorry about that. I'll fix it and try to look up the reference I mentioned earlier

    – user7351362 May 01 '17 at 21:44
  • I've added to question! – martina.physics May 01 '17 at 21:53
  • The answer is useful but as the OP suggests it does not answer the question as to the rationale for the rule that he/she gave. – Michael R. Chernick May 01 '17 at 22:01
1

The references ("good sources") for the "rule of thumb":

luart
  • 111