I am trying use the AIC & BIC for selecting the number of clusters for k-means. I found a post on Stackoverflow where some R code is posted, but I could not find any valid explanation for its correctness.
The code:
kmeansAIC = function(fit){
m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}
basically states
$$ AIC = -2 \mathcal{L}(\hat{\theta}) + 2p = (\sum_{l=1}^k\sum_{i=1}^{n_l}(\vec{x_{i,l}}-\vec{\mu_l})^2)+2 d k $$ where $k$ is the number of clusters, $n_l$ is the number of point in cluster $l$, $\mu_l$ is the cluster center of cluster $l$ and $d$ is the dimensionality of each point.
Now my problem is with the derivation of the log-likelihood function. If I consider k-means as a form of an EM algorithm with spherical clusters I get a gaussian cluster $cl_i$ with density $f_{cl_i}$ like
$$ f_{cl_i}(\vec{x}, \vec{\mu_i}, \mathbf{\Sigma}) = \frac{1}{\sqrt{(2 \pi)^d det(\mathbf{\Sigma}})} \exp(-\frac{1}{2} (\vec{x}-\vec{\mu})^T \mathbf{\Sigma}^{-1} (\vec{x}-\vec{\mu})) $$ with $\bf{\Sigma}$ as covariance of the cluster $cl_i$.
As the cluster is spherical, $\bf{\Sigma}$ can be written as $\bf{\Sigma}=\sigma^2 \bf{I}$ where $\bf{I}$ is the identiy matrix. This gives $f_{cl_i}$ as $$ f_{cl_i}(\vec{x}, \vec{\mu_i}, \Sigma) = \frac{1}{\sqrt{(2 \pi \sigma^2)^d}} \exp(-\frac{(\vec{x}-\vec{\mu})^2}{2 \sigma^2}) $$ and therefore a log-likelihood function $\mathcal{L}$ for a cluster $$ \begin{aligned} \mathcal{L}(\Theta;\vec{x}) = \sum_{i=1}^{n} log( \frac{1}{\sqrt{(2 \pi \sigma^2)^d}} \exp(-\frac{(\vec{x_i}-\vec{\mu})^2}{2 \sigma^2}) ) = \\ - n \log( \sqrt{(2 \pi \sigma^2)^d}) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (\vec{x_i}-\vec{\mu})^2 \end{aligned} $$
I understand, that the first term can be omitted, as I am only interested in differences of the AIC for different $k$ and that term is constant over $k$.
This leads me to the following formula for the AIC: $$ AIC = \frac{1}{\sigma^2} (\sum_{l=1}^k\sum_{i=1}^{n_l}(\vec{x_{i,l}}-\vec{\mu_l})^2) + 2 k d $$
which differs in the factor $\frac{1}{\sigma^2}$ compared to the formula from Stackoverflow.
I know, that $\sigma^2$ does not matter for k-means itself but it is obvious, that a greater $\sigma^2$ penalizes the model complexity in the AIC. Thus it seems strange to me, to just assume $\sigma^2 = 1$.
Is there an error in my line of argumentation or am I missing something?
I also used the same concept for calculation of the Bayesian Information Criterion and observed, that I end up with a suspiciously large number of clusters if I select the minimum of the BIC. The BIC drops very fast in the beginning and then decreases very slowly.
Is it common to select an AIC/BIC value where the scoring function doesn't show big decreases anymore (similar to the elbow technique) or should it always be the minimum of the function?
