0

Here is my understanding of how dimension reduction in LDA works:

  1. We have $n$ samples each with $p$ features assigned to $k$ classes.
  2. We use the sample mean $\mu_j$ of each class and the pooled covariance matrix $$ \Sigma = \frac{1}{n} \sum_1^n (x_i - \mu_{x_i})(x_i - \mu_{x_i})^\top $$ as the parameters for $k$ multivariate normal distributions which model $P(x| y = j)$. These are the MLE estimates of the parameters. Note: I am using $\mu_{x_i}$ to denote the sample mean of the class associated with the observation $x_i$.
  3. Changing coordinates using $z = \Sigma^{-\frac{1}{2}}x$ "spheres" the data by making the covariance matrices into the identity matrix.
  4. The sphered means $\Sigma^{-\frac{1}{2}} \mu_j$ lie on an affine subspace of dimension at most $k-1$. Classification only depends on the projection of the sphered data onto this subspace, so this is already a dimension reduction into $\min(p,k-1)$ dimensions.

Here are my questions, since other sources have been somewhat vague on these points:

Sometimes we want to reduce the dimension further.

Question 1: At this point do we just do PCA on the sphered means?

Question 2: If so, when choosing the "mean of the means" do we just use the mean of the means or do we weight the means by the number of samples in their class?

If the answer to Question 1 is "yes", I am perplexed by the common assertion that "LDA can be used for dimension reduction". I do agree that LDA can reduce the dimension to $\min(p,k-1)$ (although this is not always a reduction!), but if we are just using PCA after that I wouldn't really understand the view that "LDA is doing dimension reduction". It seems more like PCA doing the dimension reduction!

  • LDA is itself a means to reduce dimensionality. LDA is a supervised reduction and has little to do with PCA, neither it needs PCA as a second go. – ttnphns Nov 08 '23 at 20:23
  • The site has a great number of Q/A about LDA. Did you attempt to read any threads on it here? – ttnphns Nov 08 '23 at 20:25
  • Start maybe with https://stats.stackexchange.com/q/169436/3277; https://stats.stackexchange.com/a/190821/3277; https://stats.stackexchange.com/a/83114/3277 and many other places. – ttnphns Nov 08 '23 at 20:32
  • I the min(p,k-1) discriminants is many for you, you may keep few first of them. – ttnphns Nov 08 '23 at 20:37
  • @ttnphns Yes, I looked at as many of these as I could and I had already read your answers to many other such questions. Your comment "If the min(p,k-1) discriminants is many for you, you may keep few first of them" doesn't make much sense to me. The subspace containing the means doesn't appear to have any "first discriminant", "second discriminant", etc. My question is exactly to clarify what these might be. – Steven Gubkin Nov 08 '23 at 23:34
  • Perhaps could could formulate an answer explaining the case $p = 2, k = 5$ with $\Sigma$ the identity matrix. What would the "first discriminant" be in this case? – Steven Gubkin Nov 08 '23 at 23:37
  • Can you provide yourself a data snippet, a numerical example exemplifying your concerns? – ttnphns Nov 09 '23 at 03:28
  • On this pic, there exist 2 discriminants, and the 1st one is moderately stronger (larger eigenvalue, = wider class spread). One might decide to keep the 1st discriminant of the two present, saying "that will practically suffice". Do you feel unease with this perspective? – ttnphns Nov 09 '23 at 03:57

0 Answers0