Why is using $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ to calculate Fisher's rule easier than using $\mu_\mathscr{l}$?

Question

I am currently studying discriminant analysis. Fisher's discriminant $\mathscr{D}$ is defined as follows:

$$\mathscr{D} = \max_{\{ \mathbf{e} \ : \ \vert\vert \mathbf{e} \vert \vert = 1 \}} \mathscr{q} ( \mathbf{e} ) = \max_{\{ \mathbf{e} \ : \ \vert\vert \mathbf{e} \vert \vert = 1 \}} \dfrac{\mathscr{b} ( \mathbf{e} )}{\mathscr{w} ( \mathbf{e} )}$$

where $\mathbf{e}$ is a $d$-dimensional unit vector, $\mathscr{b}$ is the between-class variability, and $\mathscr{w}$ is the within-class variability.

Now, I am told that, if $W$ is invertible, then the following hold:

the between-class variability $\mathscr{b}$ is related to $B$ by $\mathscr{b} ( \mathbf{e} ) = \mathbf{e}^T B \mathbf{e}$;

the within-class variability $\mathscr{w}$ is related to $W$ by $\mathscr{w}(\mathbf{e}) = \mathbf{e}^T W \mathbf{e}$;

Fisher's discriminant $\mathscr{D}$ equals the largest eigenvalue of $W^{-1} B$; and

the unit vector $\mathbf{\eta}$ which maximises the quotient $\mathscr{q}$ is the eigenvector of $W^{-1}B$ which corresponds to $\mathscr{D}$.

I am told that Fisher's rule $\mathcal{R}_F$ is defined as follows:

$$\mathcal{R}_F = \mathscr{l} \ \ \ \ \text{if} \ \ \ \ \vert \mathbf{\eta}^T\mathbf{X} - \mathbf{\eta}^T \mathbf{\mu_{\mathscr{l}}} \vert < \vert \mathbf{\eta}^T \mathbf{X} - \mathbf{\eta}^T \mathbf{\mu_\nu} \vert \ \ \ \ \text{for all $\nu \not= \mathscr{l}$}$$

The following is then said:

Fisher's rule assigns $\mathbf{X}$ the number $\mathscr{l}$ if the scalar $\mathbf{\eta}^T \mathbf{X}$ is closest to the scalar mean $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$. Thus instead of looking for the true mean $\mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{X}$, we pick the simpler scalar quantity $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{\eta^T} \mathbf{X}$.

I am interested in this part:

Thus instead of looking for the true mean $\mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{X}$, we pick the simpler scalar quantity $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ which is closest to $\mathbf{\eta^T} \mathbf{X}$.

Why does using $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ instead of $\mathbf{\mu_\mathscr{l}}$ make this easier? If $\mathbf{\mu_\mathscr{l}}$ is difficult to calculate, then why would simply multiplying it by $\mathbf{\eta}^T$ suddenly make it easier to calculate? What is the mathematical reasoning behind this?

S. Catterall · Accepted Answer · 2021-02-11T23:13:55.067

2

This material on discriminant analysis is, I think, taken from the book Analysis of Multivariate and High-Dimensional Data by Inge Koch.

I think you are absolutely right that $\eta^T \mu_l$ is no easier to calculate than $\mu_l$ itself. In fact, $\eta^T \mu_l$ requires more calculations to obtain than $\mu_l$. However, I think that the word 'simpler' in the sentence you highlight is just emphasising that $\eta^T \mu_l$ is a scalar rather than a vector.

The author then goes on to explain two advantages of using the scalar $\eta^T \mathbf{X}$ to classify rather than the original vector $\mathbf{X}$...

It reduces/simplifies the multivariate comparisons to univariate comparisons. To classify a vector $\mathbf{X}$ using $\eta^T \mathbf{X}$, we would need $d+\kappa$ arithmetical operations, where $d$ is the dimensionality and $\kappa$ is the number of classes: $d$ operations to compute $\eta^T \mathbf{X}$, then a further $\kappa$ operations for the univariate comparisons. To classify using the original vector $\mathbf{X}$, we would need at least $d\kappa$ operations, given that we'd need to compute $\|\mathbf{X}-\mu_l\|$ for every class. So using $\eta^T \mathbf{X}$ can be much more efficient.
It gives more weight to the important variables. This is probably a bigger issue. In the two class scenario, classifying based on computing $\|\mathbf{X}-\mu_1\|$ and $\|\mathbf{X}-\mu_2\|$ is equivalent to classifying with $\eta^T \mathbf{X}$ where $\eta=(\mu_1-\mu_2)/\|\mu_1-\mu_2\|$, but this $\eta$ is in general suboptimal, as can be seen from the example below. Finding the optimal $\eta$ can give a much better classifier. From: Pattern Recognition and Machine Learning by Bishop (Fig 4.6)

Additional details for 1: the univariate comparisons involve computing $|\eta^T \mathbf{X}-\eta^T \mu_l|$ for each class $l$ in turn, making a note of which class $l$ minimises this quantity. The procedure will take longer if there are more classes, but the computation time is not affected by the dimensionality $d$.

Additional details for 2: In the two class scenario, if we classify using $\mathbf{X}$ then we assign $\mathbf{X}$ to class $1$ if $\|\mathbf{X}-\mu_1\|<\|\mathbf{X}-\mu_2\|$. Define $\eta=(\mu_1-\mu_2)/\|\mu_1-\mu_2\|$. Then $\{\eta\}$ is an orthonormal subset of the underlying vector space, so it can be extended to an orthonormal basis for the entire vector space $\{\eta,e_2,\cdots,e_d\}$. So, for $i=1,2$, $$\|\mathbf{X}-\mu_i\|^2=|\eta^T(\mathbf{X}-\mu_i)|^2+\sum_{k=2}^d |e_k^T(\mathbf{X}-\mu_i)|^2$$ Each $e_k$ is orthogonal to $\eta$, so $e_k^T(\mathbf{X}-\mu_1)=e_k^T(\mathbf{X}-\mu_2)$. Therefore $\|\mathbf{X}-\mu_1\|<\|\mathbf{X}-\mu_2\|$ precisely when $|\eta^T(\mathbf{X}-\mu_1)|<|\eta^T(\mathbf{X}-\mu_2)|$. So we're really just classifying using $\eta^T \mathbf{X}$ for this specific suboptimal $\eta$.

edited Feb 11 '21 at 23:13

answered Feb 06 '21 at 12:24

S. Catterall

3,927

Thanks for the answer. With regards to 1., can you please clarify what the "univariate comparisons" are? With regards to 2., can you please elaborate on why classifying based on computing $|\mathbf{X}-\mu_1|$ and $|\mathbf{X}-\mu_2|$ is equivalent to classifying with $|\mathbf{X}-\mu_2|$? – The Pointer Feb 11 '21 at 21:04
Sorry, I meant equivalent to classifying with $\eta^T \mathbf{X}$ in my comment above with regards to 2. Thank you for the additional details edit. – The Pointer Feb 11 '21 at 23:15
Why does each $e_k$ being orthogonal to $\eta$ imply that $e_k^T(\mathbf{X}-\mu_1)=e_k^T(\mathbf{X}-\mu_2)$? I don't understand anything after this part. – The Pointer Feb 11 '21 at 23:28
Substract one side of this equality from the other and you're left with $e_k^T(\mu_1-\mu_2)=0$, which is just saying that $e_k$ is orthogonal to $\mu_1-\mu_2$ and hence to $\eta$. – S. Catterall Feb 11 '21 at 23:32
I don't understand the point of 2. You say "it gives more weight to the important variables", as if that's supposed to improve performance, so why are we using $\eta^T \mathbf{X}$ if it's still suboptimal? Also, note that, in my post, I use $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ and $\mathbf{\mu_\mathscr{l}}$, which is different than $\eta^T \mathbf{X}$, which is what you're addressing in your answer. – The Pointer Feb 12 '21 at 00:55
Also, you say that $e_k$ is orthogonal to $\eta$, and so $e_k^T(\mathbf{X}-\mu_1)=e_k^T(\mathbf{X}-\mu_2)$. But how does this then imply that "$|\mathbf{X}-\mu_1|<|\mathbf{X}-\mu_2|$ precisely when $|\eta^T(\mathbf{X}-\mu_1)|<|\eta^T(\mathbf{X}-\mu_2)|$"? – The Pointer Feb 12 '21 at 01:01
The point of 2 is that classifying using $|\mathbf{X}-\mu_i|$, which naively seems sensible, is implicitly classifying using $\eta^T\mathbf{X}$ for a specific $\eta$ (at least in the two class scenario). Fisher's rule optimises over all possible $\eta$ so you generally get a better classifier i.e. a better $\eta$. – S. Catterall Feb 12 '21 at 10:00
Regarding your other comment, look at the displayed equation for $|\mathbf{X}-\mu_i|^2$. Each component of the sum has the same value for $i=1,2$, so the only way we could have $|\mathbf{X}-\mu_1|<|\mathbf{X}-\mu_2|$ is if $|\eta^T(\mathbf{X}-\mu_i)|^2$ differed for $i=1,2$. – S. Catterall Feb 12 '21 at 10:04
How can it be over all possible $\eta$ when we're using $\eta=(\mu_1-\mu_2)/|\mu_1-\mu_2|$, which is a single $\eta$? I don't see any iteration over many possible $\eta$ going on here. – The Pointer Feb 12 '21 at 15:29
The 'naive' classification based on $|\mathbf{X}-\mu_i|$ uses a single suboptimal $\eta=(\mu_1-\mu_2)/|\mu_1-\mu_2|$. Fisher's rule optimises over all $\eta$: but this is not done by iteration, it's done by calculation of the eigenvectors of $W^{-1}B$. – S. Catterall Feb 12 '21 at 16:40

Why is using $\mathbf{\eta}^T \mathbf{\mu_\mathscr{l}}$ to calculate Fisher's rule easier than using $\mu_\mathscr{l}$?

1 Answers1