4

This page explains very good how k-means works: http://mnemstudio.org/clustering-k-means-example-1.htm

Somewhere I heard that there is some algorithms, which use triangle inequality to speed up the process. In which step, and how exactly can use triangle inequality?

1 Answers1

2

Using triangle inequality to find the bounds to avoid redundant distance calculations.

Let x be a point and let b and c be centers.

Lemma 1: If d(b,c)>=2d(x,b), then d(x,c)>=d(x,b)

If d(b,c)/2 >= d(x,b), then d(x,b) <= d(x,c), thus no need to calculate d(x,c).

If we know an upper bound u>=d(x,c), but u >d(b,c)/2, then we need compute d(x,b) and d(x,c)

Lemma 2: d(x,c)>=max{0, d(x,b)-d(b,c)} >=max{0, l'-d(b,c)} 

where l' is a lower bound of d(x,b) in the previous iteration's calculation

The optimized algorithm is based on the fact that most distance calculations in standard K-means are redundant. If a point is far away from a center, it is not necessary to calculate the exact distance between the point and the center in order to know that the point should not be assigned to this center. Conversely, if a point is much closer to one center than to any other, calculating exact distances is not necessary to know that the point should be assigned to the first center.

Details are in the original paper about acceleration: Charles Elkan. Using the triangle inequality to accelerate k-means. In Tom Fawcett and Nina Mishra, editors, ICML, pages 147–153. AAAI Press, 2003.

Other references: Greg Hamerly, Making k-means even faster In proceedings of the 2010 SIAM international conference on data mining (SDM 2010), April 2010. pdf!