Using triangle inequality to find the bounds to avoid redundant distance calculations.
Let x be a point and let b and c be centers.
Lemma 1: If d(b,c)>=2d(x,b), then d(x,c)>=d(x,b)
If d(b,c)/2 >= d(x,b), then d(x,b) <= d(x,c), thus no need to calculate d(x,c).
If we know an upper bound u>=d(x,c), but u >d(b,c)/2, then we need compute d(x,b) and d(x,c)
Lemma 2: d(x,c)>=max{0, d(x,b)-d(b,c)} >=max{0, l'-d(b,c)}
where l' is a lower bound of d(x,b) in the previous iteration's calculation
The optimized algorithm is based on the fact that most distance
calculations in standard K-means are redundant. If a point is far
away from a center, it is not necessary to calculate the exact
distance between the point and the center in order to know that the
point should not be assigned to this center. Conversely, if a point is
much closer to one center than to any other, calculating exact
distances is not necessary to know that the point should be assigned
to the first center.
Details are in the original paper about acceleration:
Charles Elkan. Using the triangle inequality to accelerate k-means. In Tom Fawcett and Nina Mishra, editors,
ICML, pages 147–153. AAAI Press, 2003.
Other references:
Greg Hamerly, Making k-means even faster In proceedings of the 2010 SIAM international conference on data mining (SDM 2010), April 2010. pdf!