3

In K-means algorithm, it is recommender to pick the optimal K, according to the Elbow Method. However all the tutorials explain the elbow method in these 4 steps:

  • Run K-means for a range of K's
  • Calculate the Sum of Squares of the distances from the cluster mean
  • Plot a curve of the SSD over K's
  • Visually pick the K at the elbow

Can I code it as an algorithm? What is the mathematical definition of the elbow?

For example - Where is the elbow of the following curve? and Why?

enter image description here

Dimgold
  • 328
  • 1
    It seems that the elbow might be defined differently according to different loss functions – Gideon Kogan May 07 '20 at 19:31
  • What is the elbow in case of SSD/MSE? – Dimgold May 07 '20 at 19:35
  • you might find something similar to "Bayesian information criterion" – Gideon Kogan May 08 '20 at 11:15
  • 1
    These subjective measures exist because it is often an explorative method and one doesn't care about some exact hypothesis test. But also, a hypothesis test would not (easily) exist because there is no clear distribution for a statistic derived from the plot. However, you could compute an estimate for a statistic by using a Monte Carlo method. If you randomly reshuffle the category labels and redo the k-means algorithm then you can see whether the sum of squares distance is special (indicating the clustering in groups is right) or just a random statistical effect. – Sextus Empiricus Feb 10 '21 at 00:49

2 Answers2

7

Elbow method is a heuristic. There's no "mathematical" definition and you cannot create algorithm for it, because the point of the method is about visually finding the "breaking point" on the plot. This is subjective criteria and it often happens that different people could end up with different conclusions given same plots.

Tim
  • 138,066
  • 2
    Thanks! Heuristic, doesn't mean that there is no algorithm, and I believe that if there are 'guidelines' for humans, there is some algorithmic definition to it – Dimgold May 08 '20 at 09:48
  • 1
    @Dimgold Yes, there is: you look for a point where the line "breaks", the steepness changes, where this is judged on subjective manner. – Tim May 08 '20 at 09:57
  • 1
    can you pick (and explain) such a point in the curve above? – Dimgold May 08 '20 at 19:20
  • 2
    @Dimgold the curve you showed does not have clear breaking point, in my opinion. Based only on the plot above it'd be hard for me to decide for the choice. – Tim May 08 '20 at 19:33
3

It is possible to make a mathematical model for the elbow point, at least if you have a smooth decreasing curve. In fact, here are three definitions of the elbow point, where we draw a line segment $A$ that connects the endpoints of the curve:

(1) Find the tangent line to the curve that is parallel to the line segment $A.$ Define the elbow point as the point where the tangent line intersects the curve.

(2) Find the point on the curve that has the greatest vertical distance to the line segment $A.$ Define the point as the elbow point.

(3) Find the point on the curve that has the greatest perpendicular distance to $A.$ Define this point as the elbow point.

Interestingly, under some surprisingly general conditions these definitions will all determine the same point.

Update: I eyeballed the OP's data and fit a power curve of the form $y=ax^{-b},$ which gives a great fit $(R^2 > 0.99)$. Using method (1) as described above, the elbow point is around 7.7 (so the user would likely select $k=8.$ Here is a visual: Elbow Point Plot

I also fit a logarithmic curve, (not shown here, but again $R^2 > 0.99$), which gave an elbow point of 7.99, in good agreement with the first fit.

soakley
  • 4,516
  • 2
    The reason why the elbow method is a heuristic is that it has no general bearing on the problem of selecting a number of PCs. It does, though, give the analyst an opportunity to study the eiegnvalues and thereby serves as one more tool for seeing and reasoning about the data. Although it's nice to have a mathematical algorithm, doesn't that make the procedure even more arbitrary? By eliminating the need for a graphic (as well as the reasoning that goes into its contemplation), doesn't that make your approach less useful than the original heuristic? – whuber Jul 16 '21 at 21:41