The last step of PLSR

Question

Here is the reference of NIPALS for PLSR:

https://learnche.org/pid/latent-variable-modelling/projection-to-latent-structures/how-the-pls-model-is-calculated

I cannot understand the last step, to find a new loading vector $p_a$ for $X_{a}:$

$$p_a = \dfrac{1}{t'_at_a}X'_at_a.$$

For my understanding, the results $(w,t), (c,u)$ is already the loading/score vector of $X$ and $Y.$ namely NIPLAS (PLS) has finished and the next steps are purely for the regression. The reference seems say $t$ is not the corresponding score vector of loading vector $w,$ the correct score vector of $w$ should be $u.$ However, $t_a$ is calculated from $X_aw_a,$ why is not the score vector of $w?$

For my understanding, the regression step should be (consider only the first component):

Use NIPALS to obtain $(w,t), (c,u).$
Then we have $X \approx tw',\quad Y \approx uc'.$
build the regression between $u$ and $t:$ $u \approx t\theta.$
Then the relation between $X$ and $Y:$ $Y\approx Xw\theta c' = XA.$

So the relation between $x$ and $y$ should be $y = xA.$

Álvaro Méndez Civieta · Accepted Answer · 2021-06-29T07:07:38.960

The NIPALS algorithm

Given two matrices $X\in\mathbb{R}^{n\times p}$ and $Y\in\mathbb{R}^{n\times m}$, pick some $u=Y_j$ a column vector of $Y$ and iterate:

$w = \dfrac{X^\prime u}{u^\prime u}$; $w = \dfrac{w}{\|w\|^2}$. Or in equivalent step: $w = \dfrac{X^\prime u}{\|X^\prime u\|}$
$t =Xw$
$c = \dfrac{Y^\prime t}{t^\prime t}$; $c = \dfrac{c}{\|c\|^2}$. Or in equivalent step: $c = \dfrac{Y^\prime c}{\|Y^\prime c\|}$
$u = Yc$

Repeat the previous steps until the difference in $w$ from one iteration to the next is small. Once the algorithm converged, obtain the loadings as:

$p = \dfrac{Xt}{t^\prime t}$

$q = \dfrac{Yt}{t^\prime t}$

Use the loadings to deflat the original matrices

$X^{(2)} = X - tp^\prime$

$Y^{(2)} = Y - tq^\prime$

Once you have obtained $X^{(2)}$ and $Y^{(2)}$, you repeat the whole process again to compute the next PLS component, using these new matrices instead of $X$ and $Y$

Matrices of the NIPALS algorithm

Here we have then, for matrix $X$:

$w$ the weights vector of matrix $X$.
$p$ the loadings vector of matrix $X$
$t$ the scores vector of matrix $X$

For matrix $Y$:

$c$ the weights vector of matrix $Y$.
$u$ the loadings vector of matrix $Y$
$q$ the scores vector of matrix $Y$

As it was already explained here, the NIPALS algorithm solves a covariance maximization problem between $X$ and $Y$. So we use the weight vectors $w$ and $c$ to compute the scores $t$ and $u$.

Deflation of $X$

Now lets center our look on the matrices related to $X$. We can see that in the computation of $w$, the score of $Y$ is involved. This makes sense, as we are interested in maximizing the covariance between $X$ and $Y$. But now comes the deflation step. The deflation step seeks to remove from $X$ all the information already contained in the score vector $t$. So in order to do that, we cannot use $w$, as it is a mixture of information from $X$ and $Y$. We solve this problem by regrssing every column of $X$ onto $t$. This is done by computing the loadings

$$p = \dfrac{Xt}{t^\prime t}$$

Using the loadings we can obtain a sort of "prediction of $X$" that has all the information contained in $t$:

$$\hat{X} = tp^\prime$$

We then obtain

$$X^{(2)}=X - tp^\prime $$

So we make sure that every bit of information that $t$ was explaining about $X$ has been removed from $X^{(2)}$ and is not going to be used again in the next iteration.

Deflation of $Y$

One could think intuitively then, that the loadings of $Y$ should be computed as

$$q = \dfrac{Yu}{u^\prime u}$$

Aka, based on the scores of $Y$. But this is not completely correct. I said that they were not computed this way, they are computed as

$$q = \dfrac{Yt}{t^\prime t}$$

Aka, based on the scores of $X$. And why is this?. The answer is that, depending on the objective that you seek with PLS, the loadings of $Y$ can be computed one way or the other. There are many different PLS algorithms depending on which objective we have in mind. For a full review I recommend this paper, but here we are interested in PLS for regression

Imagine that $Y$ is one-dimensional, and we compute the loadings of $Y$ as

$$q = \dfrac{Yu}{u^\prime u}$$

and then perform the deflation process as

$$Y^{(2)} = Y - uq^\prime$$

If $Y$ is one-dimensional it means that it has rank 1. Then we perform this deflation step and remove from $Y$ all the information contained in $u$, which effectively reduces the rank of $Y$ by one. This wolud yield a matrix that has rank 0, and the PLS algorithm would stop here regardless of the number of variables in $X$.

For this reason, when we are interested in PLS for regression, we compute the loadings of $Y$ based on the scores of $X$, so that when doing the deflation step, the rank of $Y$ is not reduced and we can compute the necesary components based on the number of variables in $X$.

EDIT:

I answer the comments in the following edit.

Yes, in PLS for regression the $Y$ weights and loadings are the same. But this happens only in PLS for regression, in other versions like canonical PLS it does not have to be this way.
Yes, equivalently to PCA, it really reduces the rank of the matrix by one as each component is computed. For this reason the maximum number of components that can be computed either in PCA or in PLS regression is equal to the rank of matrix $X$, usually computed as $r=min(n, p)$ where $n$ is the number of observations and $p$ the number of variables.
If $X\in\mathbb{R}^{n\times p}$, where $n>p$ and $Y\in\mathbb{R}^{n\times m}$, and you are using PLS for regression you will be able to compute as much as $p$ PLS components. If you computed this much number of components, you would end up with a matrix $T\in\mathbb{R}^{n\times p}$ of $X$ scores and a matrix $U\in\mathbb{R}^{n\times p}$ of $Y$ scores, regardless of the original number of dimensions of $Y$.
In order to obtain the regression coeficients you could simply run a linear regression model between $T$ and $Y$ (if $Y$ is one-dimensional) or between $T$ and each column of $Y$ (if $Y$ is multidimensional)

I have several questions: 1. Is $q = \dfrac{Yt}{t't}$ exactly $c?$ 2. Does $X - tp'$ or $Y - uq'$ really reduce the rank? since it is no longer the Spectral decomposition of eigen value. 3. In PLS, we assume the number of components does not exceed the dimension of variable, right? So when $dim\ y = 1,$ we never expect the second component of $y.$ — user6703592, Jun 29 '21 at 04:40
For the fact that $y$ uses score $t$ rather than $u,$ I guess, $y$ first use the score $u$, then take the regression from $u$ to $t,$ is equivalent to directly use the score $t.$ To do the regression between $x$ and $y,$ we have to build the relation from their score vector. — user6703592, Jun 29 '21 at 04:44
If we ignore the regression and purely consider the PLS, the loading vector of $X/Y$ should be $w/c$ and corresponding scores should be $Xw/Yc,$ and the deflation will be $X - (Xw)w'/Y - (Yc)c',$ right? And the number of components cannot be larger than $min(rank X, rank Y),$ right? — user6703592, Jun 29 '21 at 08:00
for 2. Is there any prove? Or the prove is actually not as obvious as Spectral decomposition in PCA? Since you said $Y-tq',\ q = \dfrac{Yt}{t't},$ will not reduce the rank but $X-tp',\ p = \dfrac{Xt}{t't},$ will do reduce the rank. So i confused how to determine which case will reduce the rank. To be more general, given an matrix $M$ and an arbitrary vector $a,$ when will $M-aq',\ q = \dfrac{Ma}{a'a}$ reduce the rank of $M?$ — user6703592, Jun 29 '21 at 08:56
Regarding your first comment, no, if you purely consider PLS and not regression PLS, the loading vector would still be computed as $p = \frac{Xt}{t^\prime t}$, $q = \frac{Yu}{u^\prime u}$. It is important that the loading of X (Y) is computed based on the score vector of X (Y). Then the deflation would be made using the loading vectors. And yes, the number of components would be equal to $min(rank(X), rank(Y))$ — Álvaro Méndez Civieta, Jun 29 '21 at 09:40
Regarding your second comment, I am not aware of any proof. However, the key idea on why this rank reduction happens (in the case of PLS regression) lies in the fact that $p$, the $X$ loadings are computed by regressing every column of $X$ onto the scores of $X$. While the loadings of $Y$ are computed by regression every column of $Y$ onto the scores of $X$. So if you were to compute the loadings of $Y$ based on the scores of $Y$ that would produce the rank reduction in matrix $Y$. — Álvaro Méndez Civieta, Jun 29 '21 at 09:44
As a summery, we can understand as here everything should be based on the scores which is something like the real scores in PCA, and the loading $w/c$ is something no longer the real loading in PCA? If the my conclusion is right, could I roughly understand as the scores is the information of $X$ restricted onto $w,$ which means scores purely belong to $X.$ By comparison, A party of $w$ is from $Y,$ which is not purely from $X.$ — user6703592, Jun 29 '21 at 11:17
$w$ and $c$ are the weight vectors. The equivalent to the loading vectors in PCA are the loading vectors in PLS. The scores of $X$ are computed to maximize the covariance between them and $Y$. The loadings of $X$ are computed to retrieve all the information from the scores that is being explained in $X$. When we deflat, we remove this information. — Álvaro Méndez Civieta, Jun 29 '21 at 15:27
sorry one more question : ) https://stats.stackexchange.com/questions/530374/numerically-pca-implements-svd-or-svd-implements-pca I am not sure that we numerically implement SVD by NIPALS on its corresponding PCA or reversely implement SVD by other numerical method, then use SVD to implement PCA (one of the numerical method)? — user6703592, Jul 02 '21 at 03:21
Can you please point me to a simple Python or Julia implementation of this algorithm? I checked the SKLearn library but it has too many bells and whistles to understand the essence in terms of code. — learningMachine, Jun 01 '22 at 03:21