Here is my understanding of how dimension reduction in LDA works:
- We have $n$ samples each with $p$ features assigned to $k$ classes.
- We use the sample mean $\mu_j$ of each class and the pooled covariance matrix $$ \Sigma = \frac{1}{n} \sum_1^n (x_i - \mu_{x_i})(x_i - \mu_{x_i})^\top $$ as the parameters for $k$ multivariate normal distributions which model $P(x| y = j)$. These are the MLE estimates of the parameters. Note: I am using $\mu_{x_i}$ to denote the sample mean of the class associated with the observation $x_i$.
- Changing coordinates using $z = \Sigma^{-\frac{1}{2}}x$ "spheres" the data by making the covariance matrices into the identity matrix.
- The sphered means $\Sigma^{-\frac{1}{2}} \mu_j$ lie on an affine subspace of dimension at most $k-1$. Classification only depends on the projection of the sphered data onto this subspace, so this is already a dimension reduction into $\min(p,k-1)$ dimensions.
Here are my questions, since other sources have been somewhat vague on these points:
Sometimes we want to reduce the dimension further.
Question 1: At this point do we just do PCA on the sphered means?
Question 2: If so, when choosing the "mean of the means" do we just use the mean of the means or do we weight the means by the number of samples in their class?
If the answer to Question 1 is "yes", I am perplexed by the common assertion that "LDA can be used for dimension reduction". I do agree that LDA can reduce the dimension to $\min(p,k-1)$ (although this is not always a reduction!), but if we are just using PCA after that I wouldn't really understand the view that "LDA is doing dimension reduction". It seems more like PCA doing the dimension reduction!