Akaike Information criterion for k-means

Question

I am trying use the AIC & BIC for selecting the number of clusters for k-means. I found a post on Stackoverflow where some R code is posted, but I could not find any valid explanation for its correctness.

The code:

kmeansAIC = function(fit){
  m = ncol(fit$centers)
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + 2*m*k)
}

basically states

$$ AIC = -2 \mathcal{L}(\hat{\theta}) + 2p = (\sum_{l=1}^k\sum_{i=1}^{n_l}(\vec{x_{i,l}}-\vec{\mu_l})^2)+2 d k $$ where $k$ is the number of clusters, $n_l$ is the number of point in cluster $l$, $\mu_l$ is the cluster center of cluster $l$ and $d$ is the dimensionality of each point.

Now my problem is with the derivation of the log-likelihood function. If I consider k-means as a form of an EM algorithm with spherical clusters I get a gaussian cluster $cl_i$ with density $f_{cl_i}$ like

$$ f_{cl_i}(\vec{x}, \vec{\mu_i}, \mathbf{\Sigma}) = \frac{1}{\sqrt{(2 \pi)^d det(\mathbf{\Sigma}})} \exp(-\frac{1}{2} (\vec{x}-\vec{\mu})^T \mathbf{\Sigma}^{-1} (\vec{x}-\vec{\mu})) $$ with $\bf{\Sigma}$ as covariance of the cluster $cl_i$.

As the cluster is spherical, $\bf{\Sigma}$ can be written as $\bf{\Sigma}=\sigma^2 \bf{I}$ where $\bf{I}$ is the identiy matrix. This gives $f_{cl_i}$ as $$ f_{cl_i}(\vec{x}, \vec{\mu_i}, \Sigma) = \frac{1}{\sqrt{(2 \pi \sigma^2)^d}} \exp(-\frac{(\vec{x}-\vec{\mu})^2}{2 \sigma^2}) $$ and therefore a log-likelihood function $\mathcal{L}$ for a cluster $$ \begin{aligned} \mathcal{L}(\Theta;\vec{x}) = \sum_{i=1}^{n} log( \frac{1}{\sqrt{(2 \pi \sigma^2)^d}} \exp(-\frac{(\vec{x_i}-\vec{\mu})^2}{2 \sigma^2}) ) = \\ - n \log( \sqrt{(2 \pi \sigma^2)^d}) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (\vec{x_i}-\vec{\mu})^2 \end{aligned} $$

I understand, that the first term can be omitted, as I am only interested in differences of the AIC for different $k$ and that term is constant over $k$.

This leads me to the following formula for the AIC: $$ AIC = \frac{1}{\sigma^2} (\sum_{l=1}^k\sum_{i=1}^{n_l}(\vec{x_{i,l}}-\vec{\mu_l})^2) + 2 k d $$

which differs in the factor $\frac{1}{\sigma^2}$ compared to the formula from Stackoverflow.

I know, that $\sigma^2$ does not matter for k-means itself but it is obvious, that a greater $\sigma^2$ penalizes the model complexity in the AIC. Thus it seems strange to me, to just assume $\sigma^2 = 1$.

Is there an error in my line of argumentation or am I missing something?

I also used the same concept for calculation of the Bayesian Information Criterion and observed, that I end up with a suspiciously large number of clusters if I select the minimum of the BIC. The BIC drops very fast in the beginning and then decreases very slowly.

Is it common to select an AIC/BIC value where the scoring function doesn't show big decreases anymore (similar to the elbow technique) or should it always be the minimum of the function?

I'm confused. Isn't the $\sigma$ dependent on the cluster? In other words, should they be $\sigma_i$? — FinanceGuyThatCantCode, Aug 01 '19 at 15:32

score 4 · Answer 1 · answered Sep 22 '21 at 08:24

The model for which k-means is maximum likelihood is not a mixture model (there are no mixture proportions, and the classification of observations to clusters is crisp whereas in mixture models there are posterior probabilities for this), but a model for which every observation has its own parameter indicating to which cluster it belongs. This is nonstandard as the number of parameters converges to infinity, and standard maximum likelihood theory doesn't hold (it is well known that the estimated k-means are not consistent for the means of the corresponding model, which is contrary to standard ML theory). This means that also any theory behind AIC and BIC doesn't hold, and these methods are theoretically invalid. (Obviously it may be that they do a good job in some situations anyway; note also that it doesn't matter whether these strange discrete observation parameters are counted into the general number of parameters or not, because for a given dataset their number is constant. One would need to deal with them though in any attempt to derive asymptotic theory.)

The model does not assume $\sigma^2=1$, and it is invalid as well to assume $\sigma^2$ to be dependent of the data. In fact it makes some sense to have a penalty that depends on $\sigma^2$, because if the data have a certain variation, more clusters with a low variance are needed to cover them in a Voronoi tesselation fashion (which is what k-means does) than if the clusters have a larger variance. In fact in this way using the AIC you could also choose a variance estimator, which is connected to the number of clusters in that fashion - smaller variance, more clusters needed. This looks surprising because the variance in k-means is usually ignored because people are only interested in the clustering for given k.

Despite this, AIC and BIC are invalid anyway, for the reason stated above, or rather, nobody has yet demonstrated how they may be valid, and the standard arguments don't apply.

score 3 · Accepted Answer · answered Nov 14 '17 at 15:39

I think that the answer you refer to in the link in Stackoverflow, is in turn based on another link: http://sherrytowers.com/2013/10/24/k-means-clustering/.

In that post the data used is standardized the variance is equal to 1 and the formula makes sense. In any case I think your line of reasoning is absolutely correct, and one should refrain to use that expression if the underlying assumption is not correct.

Proof.by.Accident · Answer 3 · 2020-03-25T17:32:38.587

I know this is an old (and answered) question, but I wanted to jump in because I ran into this same question, and was surprised by how little information there was available.

Before diving in to my answer, I wanted to echo what was said by FinanceGuyThatCantCode: you have a small error in your proposal. The $\sigma$ in your question really should be $\sigma_i$, since it depends on each cluster. So what you're asking is not why total variance is assumed equal to 1, but rather why each cluster variance is assumed equal to 1.

The answer is that the $\sigma_i=1$ assumption is actually an assumption of the k-means algorithm. Lets extend your analogy that k-means is really a latent Gaussian problem, now look at the k-means loss function: $$ \min_{\mu_i} \sum_i \sum_{x_i \in S_i} |x_i - \mu_i|^2 $$
Where $S_i$ is the set of $x_i$ assigned to center $\mu_i$. If we imagine that this loss function is the negative log-likelihood, and that what we're really doing is a likelihood maximization of the latent Gaussian means, we'd get the following equivalent likelihood maximization: $$ \max_{\mu_i} \prod_i \prod_{x_i \in S_i} \exp \left[ (x_i - \mu_i)^2 \right] $$
Playing a little fast and loose with scaling constants, this is the same as: $$ \max_{\mu_i} \prod_i \prod_{x_i \in S_i} N(x_i;\mu_i,\sigma_i^2 = 1) $$
Where $N(x_i;\mu_i,\sigma_i^2 = 1)$ is the normal PDf.

So we see that k-means implicitly assumes that the latent Gaussians have unit variance. This is because k-means isn't really meant to be a probabilistic model of the data; it's not a "generative" model which is intended to predict future datapoints. It's just an algorithm which uses a reasonable heuristic to produce label assignments.

Speculation incoming: I believe that if you wanted to relax the assumption that $\sigma_i=1$, then your best estimate would probably be something like $\frac{1}{|S_i|-1} \sum_{x_i \in S_i} |x_i-\mu_i|^2$, and so the AIC you propose (with the correction of $\sigma$ to $\sigma_i$) would collapse into something like: $$ \sum_i |S_i| + 2 k d = n + 2 k d $$
Which is just a direct penalty on $k$.

Is there a minus sign inside exp[(xi-ui)^2] for probability<1? — Ryan SY Kwan, Sep 21 '21 at 15:13

Ryan SY Kwan · Answer 4 · 2021-09-22T07:10:18.143

An alternative to assume $\sigma^2=1$, is to assume $$ \begin{aligned} \sigma^2 = \text{average residual sum of square} = \\ \frac{RSS}{\text{number of cell}} = \\ \frac{\sum_{l=1}^k\sum_{i=1}^{n_l}(\vec{x_{i,l}}-\vec{\mu_l})^2}{nd} \end{aligned} $$

The resulting formula seems to agree with the BIC in the Gaussian special case, where

The modified code:

kmeansAIC = function(fit){
  d = ncol(fit$centers)
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
return(D + 2dk) # original
return(ndlog(D/n/d) + 2dk) # for individual cell
return(nlog(D/n) + 2k) # for individual row
}

Following your log-likelihood function $\mathcal{L}$ for a cluster $$ \begin{aligned} \mathcal{L}(\Theta;\vec{x}) = \sum_{i=1}^{n} log( \frac{1}{\sqrt{(2 \pi \sigma^2)^d}} \exp(-\frac{(\vec{x_i}-\vec{\mu})^2}{2 \sigma^2}) ) = \\ - n \log( \sqrt{(2 \pi \sigma^2)^d}) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (\vec{x_i}-\vec{\mu})^2 \end{aligned} $$ Instead of assuming $\sigma^2=1$, we get $$ \begin{aligned} =-\frac{nd}{2}(log(2 \pi) + log( \sigma^2)) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (\vec{x_i}-\vec{\mu})^2 = \\ -\frac{nd}{2}log(2 \pi) -\frac{nd}{2} log( \frac{RSS}{nd}) - \frac{1}{2 \frac{RSS}{nd}}\text{RSS} = \\ -\frac{nd}{2} log( \frac{RSS}{nd}) + Constant(nd) \end{aligned} $$ which brings us back to the special case BIC above except that nd refers to number of cells and n to number of rows.

score 0 · Answer 5 · edited Sep 07 '22 at 23:33

0

It doesn't matter anyway because you typically compare which AIC is better, in which case the constant doesn't matter (if $A > B$, $cA > cB$ for positive $c$).

edited Sep 07 '22 at 23:33

utobi

11,726

answered Sep 07 '22 at 20:28

mlanier

101
1

1

The problem with this is that one often is interested in how different the two values might be. It helps to use a universal unit of measure for AIC. – whuber Sep 08 '22 at 15:49

user36302 · Answer 6 · 2023-12-01T20:18:49.333

It has been previously stated that the k-means objective is equivalent to the log-likelihood with spherical clusters under hard assignment, which gives rise to a valid AIC, eg here http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/EM_annotatedonclass.pdf with corresponding AIC here: https://nlp.stanford.edu/IR-book/html/htmledition/cluster-cardinality-in-k-means-1.html

Given Christian Hennig's comments I was trying to figure out where/when this argument breaks down.

Let $\Theta = \text{vec}(\{ \vec \mu_j, \sigma \})$, and $\mu_{C(i)}$ be the function that assigns $\vec x_i$ to the centroid of cluster $j$, i.e. the nearest $\vec \mu_j$

$$ \begin{aligned} \mathcal{\ell}(\Theta;\vec{x}) & = \prod_{i=1}^{n} \sum_{j=1}^{k} \Pr( \vec x_0 = \vec x_i | \vec x_0 \in C_j ) \cdot \Pr(\vec x_0 \in C_j) \\ & = \prod_{i=1}^{n} \sum_{j=1}^{k} (2\pi\sigma^2)^{-d/2} \exp \left( -||\vec{x_i}-\vec{\mu_j}||^2 / 2\sigma^2 \right) \cdot \pi_j( \sigma, \{ \vec \mu_{\ell} \}_{\ell=1}^k) \\ & \ne \prod_{i=1}^{n} \sum_{j=1}^{k} (2\pi\sigma^2)^{-d/2} \exp \left( -||\vec{x_i}-\vec{\mu_j}||^2 / 2\sigma^2 \right) \cdot 1(\vec{x_i} \in C_j| \{ \vec \mu_j \}, \sigma) \\ & \text{which is equivalent in arg max to the k-means objective function:} \\ & \space \arg\min_\Theta \sum_{i=1}^{n} \frac{||\vec{x_i}-\vec{\mu_{C(i)}}||^2}{2 \sigma^2} \end{aligned} $$

The problem is that even under the hard-assignment rule, you can't just replace the probability of being in cluster $C_i$ with an indicator function. Even though assignment of $x_i$ to $C_j$ is deterministic, given $\mu$ and $\sigma$, to retain a legitimate likelihood, you still need to re-normalize the now truncated Normal distributions, which only have density integrated over $C_j$, while the 'catchment area' for $C_j$ is now a polygonal/polyhedral region (tesselation) determined by the distances to all the $\vec \mu_j$'s.

So even under hard assignment, the k-means objective is equivalent to maximizing a likelihood only if the probability weights $\pi_j$ can be replaced with indicator functions without affecting the validity of the normal densities.

So basically this is to say the AIC applies to k-means when the data-generating mechanism is from a tesselation of the space by spherical multivariate normals truncated by the decision boundaries, but which don't require renormalization due to the truncated density, then k-means centroids are the same as the $\mu_j$'s and the AIC based on this would be a valid likelihood approach to comparing k-means across different $k$. Otherwise your mileage may vary.

The one case I can think of where the conditions are satisfied is if your clusters are spherical with equal variance and widely separated. This is exactly what is suggested by the $\sigma \rightarrow 0$ results here: K-means as a limit case of EM algorithm for Gaussian mixtures with covariances $\epsilon^2 I$ going to $0$ and here: https://yuhangzhou88.github.io/ESL_Solution/ESL-Solution/_13-Prototypes-and-Nearest-Neighbors/ex13-1/. However, the AIC will not be especially helpful in determining the number of clusters in that case, as it is unlikely that you are in doubt about $k$ when the clusters are obvious and distinct.

Akaike Information criterion for k-means

6 Answers6

return(D + 2dk) # original

return(ndlog(D/n/d) + 2dk) # for individual cell