4

Let $\textbf{X}$ be an $n \times p$ matrix with the rows containing observations and the columns containing features. Also assume that the features are centered at $0$. Let $C_k\subset \{1, \dots n \}$ contain the indices of the observations that belong to class $k$. Why is an estimate of the within-class covariance matrix $$ \widehat{\Sigma_w} = \frac{1}{n}\sum_{k=1}^{K} \sum_{i \in C_k} (\textbf{x}_i- \hat{\mu}_k)(\textbf{x}_i-\hat{\mu}_{k})^{T}$$

For a particular class $k$, are interested in the covariance between observations. So we would have $\binom{|C_{k}|}{2}$ covariances for $k \in \mathbb{N}$. The matrix should be of size $\binom{|C_{k}|}{2} \times k$.

Added. $w$ is just an abbreviation for within-class variance.

Damien
  • 773
  • I don't see "?" - so where is the question itself? What is K? - number of classes? If yes then how does it differ from t? Are you speaking of the pooled within class covariance matrix? – ttnphns Jul 12 '12 at 20:50
  • @ttnphns, it appears the question is: why is $\widehat{\Sigma}_w$ a reasonable estimate? What I'm wondering is: what is $w$? It only appears in the subscript for $\Sigma$ and nowhere else. As you pointed out, $t$ is also mysterious. – Macro Jul 13 '12 at 12:33
  • just a note - that's the total within-class covariance matrix. – Ran Jul 13 '12 at 16:40

1 Answers1

1

If your question is not so much "why is this the within-class covariance?" and more "why use this covariance?", I would recommend checking out this derivation (pdf) of the maximum likelihood solution for Linear Discriminant Analysis.

The parameters of the model $\theta = (\mu_{0}, \mu_{1}, \Sigma, \pi)$ are chosen to maximise the joint likelihood of the training set given the parameters $\prod_{i} p(x_{i}, y_{i} | \theta)$.

This is only for the binary case $K = 2$. Regardless of the number of classes $K$ and the number of examples $n$, the dimension of the covariance matrix is $p \times p$.