Questions tagged [k-means]

k-means is a method to partition data into clusters by finding a specified number of means, k, s.t. when data are assigned to clusters w/ the nearest mean, the w/i cluster sum of squares is minimized

K-means clustering attempts to achieve the following objective:

Given an integer $k$ and a set of $n$ data points in $\mathbb{R}^{d}$, the goal is to choose $k$ centers so as to minimize the total squared distance between each point and its closest center, also known as the within-group sum of squares.

To solve this problem exactly is in fact NP-hard, so instead an approximation algorithm is used:

  1. Choose $k$ initial centroids. The most basic method is to choose $k$ samples from the dataset $X$, although other variations exist.
  2. Assign each data point to its nearest centroid.
  3. Create new centroids by taking the mean value of all of the data points assigned to each previous centroid. Find the difference. This is the within-group sum of squares.
  4. Repeats steps 2. and 3. until the difference is less than a threshold.

Mathematically, k-means attempts to choose centroids that minimize the following objective function:

$$ \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)$$

where $x_j$ are data points and $\mu_i$ is the $i^{th}$ centroid.

1046 questions
14
votes
3 answers

k-means vs k-means++

As far as I know k-means picks the initial centers randomly. Since they're based on pure luck, they can be selected really badly. The K-means++ algorithm tries to solve this problem, by spreading the initial centers evenly. Do the two algorithms…
12
votes
3 answers

Assigning class labels to k-means clusters

I have a very basic question on clustering. After I have found k clusters with their centroids, how do I go about interpreting the classes of the data points that I have clustered (assigning meaningful class labels to each cluster). I am not talking…
Riyaz
  • 230
7
votes
1 answer

What does cluster size mean (in context of k-means)?

http://en.wikipedia.org/wiki/Cluster_analysis It states that: K-means separates data into Voronoi-cells, which assumes equal-sized clusters (not adequate here) and shows the image: Question: What do they mean by size? Is it the size of the spread…
5
votes
2 answers

K-means++ algorithm

I try to implement k-means++, but I'm not sure how it works. I have the following dataset: (7,1), (3,4), (1,5), (5,8), (1,3), (7,8), (8,2), (5,9), (8,0) From the wikipedia: Step 1: Choose one center uniformly at random from among the data points.…
4
votes
1 answer

k-means triangle inequality

This page explains very good how k-means works: http://mnemstudio.org/clustering-k-means-example-1.htm Somewhere I heard that there is some algorithms, which use triangle inequality to speed up the process. In which step, and how exactly can use…
3
votes
2 answers

Termination conditions for K-means and their interconnection

As far as I know, there are two termination criteria for K-means clustering algorithm: assignments of data points do not change centroids do not change I wonder if there is any kind of relation between these criteria. Do they imply each other?
3
votes
1 answer

Should one randomize the order in the data, for a k-means cluster analysis?

I have read somewhere that it is better to randomize the order of your data several times, and perform each time the corresponding ulterior kmeans analysis, to be sure that your clustering results are consistent (reproducible). In this way, you…
Faustino
  • 145
3
votes
2 answers

k-means with several repetitions

In matlab and python, when running k-means, it is possible to set several repetitions (with random init) so that all of them in the end are combined to have stable global result? I am wondering how these several outputs are combined? If they have…
fractile
  • 911
3
votes
2 answers

K-means for non-spherical (non-globular) clusters

It is said that K-means clustering "does not work well with non-globular clusters." However, is this a hard-and-fast rule - or is it that it does not often work? I have a 2-d data set (specifically depth of coverage and breadth of coverage of…
meld24
  • 75
2
votes
3 answers

Modified K-means with unequal cluster variances

I wonder how I can modify the K-means algorithm so that the cluster volumes are not equal to each other. The K-means objective is to minimize within cluster sum of squares $\sum_{i=1}^{p} {\parallel \mathit{X}_i-\mathit{L}_{\mathit{Z}_i}…
user5054
  • 1,549
2
votes
3 answers

K-Means clustering after first iteration

In k-means clustering we initially pick $k$ random centroids and assign the given data to one of these $k$ centroids (which ever is nearest). After this we create new centroids by taking the mean of the assigned points. However there might be case…
Abhilash
  • 153
2
votes
1 answer

In k-means, why does the sum of distances to the cluster not account for the size of the cluster?

Wikipedia gives one formulation for the k-means problem as: where we intend to find a set of clusters $S = \{S_1, \ldots, S_k\}$ to minimize this value. However, the equivalent formulation divides the sum of pairwise distances by the size of the…
z611
  • 255
2
votes
1 answer

K means: How to assign a cluster to a point which is equidistant from more than one cluster

How to assign a cluster to a point which is equidistant from more than one cluster?
2
votes
1 answer

K-means and reproducibility

I'm trying to find the optimal K-means clustering for a set of elements. For a particular K, K-means repeated several times does not always converge to the same clustering due to the randomness in the initialization. This means that whatever…
2
votes
1 answer

Correlated variables in kmeans clustering

I have a dataset with 3 variables: A, B and C. Now, A and B are ordinal variables (i.e.; the result of two questions measured using a 5-point Likert), whereas B is continuous. A and B are also correlated, Spearman rho = .50, p-value = 0.0046 I want…
dfucci
  • 121
1
2 3