I am currently studying discriminant analysis. Fisher's discriminant $\mathscr{D}$ is defined as follows:
$$\mathscr{D} = \max_{\{ \mathbf{e} \ : \ \vert\vert \mathbf{e} \vert \vert = 1 \}} \mathscr{q} ( \mathbf{e} ) = \max_{\{ \mathbf{e} \ : \ \vert\vert \mathbf{e} \vert \vert = 1 \}} \dfrac{\mathscr{b} ( \mathbf{e} )}{\mathscr{w} ( \mathbf{e} )}$$
where $\mathbf{e}$ is a $d$-dimensional unit vector, $\mathscr{b}$ is the between-class variability, and $\mathscr{w}$ is the within-class variability.
Now, I am told that, if $W$ is invertible, then the following hold:
- the between-class variability $\mathscr{b}$ is related to $B$ by $\mathscr{b} ( \mathbf{e} ) = \mathbf{e}^T B \mathbf{e}$;
- the within-class variability $\mathscr{w}$ is related to $W$ by $\mathscr{w}(\mathbf{e}) = \mathbf{e}^T W \mathbf{e}$;
- Fisher's discriminant $\mathscr{D}$ equals the largest eigenvalue of $W^{-1} B$; and
- the unit vector $\mathbf{\eta}$ which maximises the quotient $\mathscr{q}$ is the eigenvector of $W^{-1}B$ which corresponds to $\mathscr{D}$.
I am told that Fisher's rule $\mathcal{R}_F$ is defined as follows:
$$\mathcal{R}_F = \mathscr{l} \ \ \ \ \text{if} \ \ \ \ \vert \mathbf{\eta}^T\mathbf{X} - \mathbf{\eta}^T \mathbf{\mu_{\mathscr{l}}} \vert < \vert \mathbf{\eta}^T \mathbf{X} - \mathbf{\eta}^T \mathbf{\mu_\nu} \vert \ \ \ \ \text{for all $\nu \not= \mathscr{l}$}$$
The following is then said:
Fisher's rule assigns $\mathbf{X}$ the number $\mathscr{l}$ if the scalar $\mathbf{\eta}^T \mathbf{X}$ is closest to the scalar mean $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$. Thus instead of looking for the true mean $\mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{X}$, we pick the simpler scalar quantity $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{\eta^T} \mathbf{X}$.
I am interested in this part:
Thus instead of looking for the true mean $\mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{X}$, we pick the simpler scalar quantity $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{\eta^T} \mathbf{X}$.
Why does using $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ instead of $\mathbf{\mu_\mathscr{l}}$ make this easier? If $\mathbf{\mu_\mathscr{l}}$ is difficult to calculate, then why would simply multiplying it by $\mathbf{\eta}^T$ suddenly make it easier to calculate? What is the mathematical reasoning behind this?
