2

In k-means clustering we initially pick $k$ random centroids and assign the given data to one of these $k$ centroids (which ever is nearest). After this we create new centroids by taking the mean of the assigned points.

However there might be case that the initially selected random centroids may not be nearest to any point in the dataset and hence no points would be assigned to these centroids. So in such case what should be done in the step of creating new centroids?

Glen_b
  • 282,281
Abhilash
  • 153
  • 2
    What exactly do you mean by "not nearest to any point", exactly? It would seem that by definition, if there are centroids living in a real or other metric space with points, a given point must necessarily be nearest to some centroid. – Louis Cialdella Oct 12 '13 at 06:21
  • For example if we pick 3 centroids and all the datapoints are nearest to either centroid 1 or 2 .In such case all the points would be assigned to centroid 1 or 2 and centroid 3 would not have any points assigned – Abhilash Oct 12 '13 at 06:30
  • Aha. In that case, the points can be assigned to one arbitrarily without any real issue.

    Additionally, you might also find interesting some of the methods of choosing the initial points, such as K-Means++.

    – Louis Cialdella Oct 12 '13 at 06:39
  • 2
    @LCialdella: I think what he means is the situation where one of the centroids has no points assigned to it. – nico Oct 12 '13 at 06:55
  • @LCialdella .You got me wrong.I might not have explained clearly .I meant what 'nico' has mentioned. – Abhilash Oct 12 '13 at 07:13
  • @nico could you please let me know what needs to be done in such scenarios – Abhilash Oct 12 '13 at 07:14
  • I see no problem. A cluster may stay empty, after all. I checked your situation - with one initial center far away - in SPSS, which uses Hartigan (1975) algorithm. There comes out an empty cluster without any error message. – ttnphns Oct 12 '13 at 10:38

3 Answers3

3

I am not sure if there is a "standard" thing to do in the case one of the initial centroids is completely off.

You can easily test this by specifying the initial centroids and see how things evolve!

For instance, R will just give you an error.

Say you do:

# Set the RNG seed to ensure reproducibility
set.seed(12345)

# Let's create 3 visually distinct clusters
n <- c(1000, 500, 850)
classifier.1 <- c(rnorm(n[1], 10, 0.9), 
                  rnorm(n[2], 25, 2),
                  rnorm(n[3], 35, 2))
classifier.2 <- c(rnorm(n[1], 5, 1),
                  rnorm(n[2], 10, 0.4),
                  rnorm(n[3], 2, .9))

col = c("blue", "darkgreen", "darkred")
# Run k-means with 3 clusters and random initial centroids 
# to check the clusters are correctly recognized
km <- kmeans(cbind(classifier.1, classifier.2), 3)
# Plot the data, colored by cluster
plot(classifier.1, classifier.2, pch=20, col=col[km$cluster])

# Mark the final centroids
points(km$centers, pch=20, cex=2, col="orange")

# Now impose some obviously "wrong" starting centroids
start.x <- c(10, 25, 3000)
start.y <- c(10, 10, -10000)
km.2 <- kmeans(cbind(classifier.1, classifier.2), 
               centers=cbind(start.x, start.y))

Now, R has obviously no issue in discriminating the 3 clusters when you let it choose the initial centroids, but when you run it the second time it will just say:

Error: empty cluster: try a better set of initial centers

I guess that if you are implementing your own algorithm you may choose to use this behaviour or rather give the user a warning and let the algorithm choose the centroids by itself.

Obviously, as others pointed out, there are algorithms such as k-means++ that help in choosing a good set of starting centroids.

Also, in R you can use the nstart parameter of the kmeans function to run several iterations with different centroids: this will improve clustering in certain situations.

EDIT: also, note from the R kmeans help page

The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended. For ease of programmatic exploration, k=1 is allowed, notably returning the center and withinss.

Except for the Lloyd–Forgy method, k clusters will always be returned if a number is specified. If an initial matrix of centres is supplied, it is possible that no point will be closest to one or more centres, which is currently an error for the Hartigan–Wong method.

nico
  • 4,581
  • 3
  • 32
  • 45
1

i think that i understand what he means,

in this case you have to assign new random centeroid instead of the one with no near point and do the regular k-means objective function for the others, so you will have run random function for the centroid and k-means rule for other.

i hope it was clear

hassan
  • 11
0

If your problem absolutely requires that you do centroid-based clustering and at the end no points are assigned to one or more of the centroids then I thinks it should not be an issue.

What it shows, however, is that the data does not have points which you expected based on the provided centroids.

I ran into similar situation where the system generated following error:

Error: empty cluster: try a better set of initial centers

Here, the issue was my data was scaled but the centroids were unscaled.

Sadiaz
  • 121