6

I have done some clustering to a matrix with 30 random variables , each variable has 13000 observations ). i got 10 clusters

and now i need to test how good the clustering is by calculating the variance in each cluster. does anyone knows how can i calculate the variance?

i can easily calculate the variance of each column in my matrix (e.g the variance of each random variable) but i want to calculate the variance of the whole cluster.

does anyone know how it can be done?

e.g.

data <- data.frame(x=c(2,2,2,3,7),
               y=c(30,40,40,30,10),
               z=c(1,2,3,4,5),
               cluster=c('a','a','c','a','c'))

candidates <- dlply(data,.(cluster),function(data){
 laply(data[,-4],var)
})

This gives variance per column for each cluster label (a,c). I don't think it's the right approach.

Antoine
  • 6,159
  • "Variance" is a function of one variable, so I'm not clear what you mean by "variance of a cluster". You could find various measures of the distance between points, calculated in various ways. Is that what you had in mind? – Peter Flom Feb 14 '14 at 23:27
  • 1
    By the term "multivariate variance within a cluster" we usually mean the sum of diagonal elements (the trace) of the covariance matrix computed for that cluster. – ttnphns Feb 15 '14 at 09:29

1 Answers1

8

According to the Hastie equation 14.31 (see also Halkidi et al. 2001), the within-cluster variance $W(C_{k})$ of a cluster $C_{k}$ is defined (for the Euclidean distance) as $\sum_{x_{i}\in{C_{k}}}\|x_{i}-\bar{x}_{k}\|^2$ , where $\bar{x}_{k}$ is the mean of cluster $C_{k}$ (also called the cluster centroid, its values are the coordinate-wise average of the data points in $C_{k}$), and {${x_{1}, ..., x_{N}}$} is the set of observations (they are vectors, i.e., one coordinate per dimension). In plain English, the cluster variance is the coordinate-wise squared deviations from the mean of the cluster of all the observations belonging to that cluster. The total within cluster scatter (for the entire set of observations) is simply $W=\sum\limits_{k=1}^K\sum_{x_{i}\in{C_{k}}}\|x_{i}-\bar{x}_{k}\|^2$ for K clusters and N observations with $K<N$. The goal of a clustering algorithm such as K-means is to minimize this quantity (or to maximize the between-cluster variance $B$). The total point scatter in the data set (the information) $T$ is equal to $W+B$. When clustering, we just make sense of the information originally present by decreasing $W$ and increasing $B$ as much as possible, however, $T$ remains constant, there is no information loss (as opposed to dimensionality reduction techniques such as PCA, for instance).

Antoine
  • 6,159
  • 1
    This is picky, but in the previous answer I would not call W(c_k) the "within-cluster variance" since it does not involve dividing by the degrees of freedom. HTF call it the scatter, but do not call it a variance. – Edward Malthouse Jan 01 '20 at 22:15