2

How to assign a cluster to a point which is equidistant from more than one cluster?

Nick Cox
  • 56,404
  • 8
  • 127
  • 185

1 Answers1

2

Theoretically the cluster means and cluster assignments should be chosen so that the k-means objective function is minimised. Note that this implies that the cluster assignments and the means are chosen at the same time. The question seems to imply that the clusters are there before a point is assigned, but in fact the assignment of all points defines the clustering. It may theoretically happen that more than one choice of assignments gives the same optimal value of the objective function, in which case there is just no unique optimum. For example, if your data are 1,2,3,4,5, clustering with $k=2$ as $\{1,2,3\}, \{4,5\}$ is just as good as $\{1,2\}, \{3,4,5\}$, and there is no objective rule that says that one of these should be preferred to the other. It can be reported that these solutions have the same quality and the data are ambiguous.

As the k-means solution is usually computed by an iterative algorithm (and is not normally guaranteed to be globally optimal), it can also happen during the algorithm that a point is equidistant from the cluster means produced in the previous iteration (in this situation it makes sense to think of the clusters to be there first, and then an observation is assigned to them). I'd think that this happens quite a bit more often than that two overall solutions really have the same quality. In that case it depends on the implementation of the algorithm what is done. One could assign the point randomly, or one could assign it to the smallest cluster, or do something else. There is no theory, as far as I know, that prefers anything in particular. One could hope that, when running the algorithm lots of times from random starting points, ultimately the best solution is found, if there's only one (which can well happen even if there is an ambiguity regarding one or more points at some stage during the algorithm).

As far as I know, a standard implementation of k-means unfortunately won't tell you that there are several equally good solutions if this is the case (although there may be software that does it, assuming that these solutions are all found). Normally it will just assign such an observation somehow and report the resulting clustering.