I was trying to understand what the score variable was in MATLAB. The PCA documentation says:
Principal component scores are the representations of X in the principal component space. Rows of score correspond to observations, and columns correspond to components.
What I find confusing the following:
scores are the representations of X in the principal component space.
since I am not sure what that means precisely. For me (at least from a auto-encoding perspective) the representation of the data $X_N \in \mathbb{R}^{D \times N}$ in the principal component space would be the projection of all the data set points $X_N$ (where the data set points are the columns) on the column space of $U$, the eigenvectors of the covariance matrix $C_N = \frac{1}{N} \sum^{N}_{n=1} (x^{(n)} - \bar{x}) ({x^{(n)}} - \bar{x})^T = \frac{1}{N} (X-\bar{X})(X - \bar{X})^{T}$. Therefore, score should be the best linear combination of the principal components $U$.
For one single data vector $x^{(i)}$ one can notice the following:
$$ a^{(a)} = \left( \begin{array}{c} u^T_1 x^{(n)}\\ \vdots \\ u^T_k x^{(n)}\\ \vdots \\ u^T_K x^{(n)} \end{array} \right)=U^Tx^{(n)}$$
produces the coefficients of projections onto each principal component. Thus, each component $k$ of $a^{(i)}_k$ tells you how much the data $x^{(i)}$ projects on the direction of eigenvector $u_k$. Thus one can reconstruct one single data as follows:
$$\tilde{x}^{(i)} = \sum^{K}_{k=1} a^{(i)}_k u_k = U a^{(i)} = U U^T x^{(i)} $$
From the above its not to hard to see that the following equation will reconstruct the whole data matrix $X_N$:
$$ \tilde{X}_N = U U^T X_N$$
therefore what occurred to me to understand what the variable score actually represents was to compare it with the above equation. Thus, I wrote the following script that does exactly that:
D = 3
N = 5
X = rand(D, N);
%% process data
x_mean = mean(X, 2); %% computes the mean of the data x_mean = sum(x^(i))
X_centered = X - repmat(x_mean, [1,N]);
%% PCA
[coeff, score, latent, ~, ~, mu] = pca(X'); % coeff = U
[U, S, V] = svd(X_centered); % coeff = U
%% Reconstruct data
X_tilde_U = U * U'*X
X_tilde_coeff = coeff*coeff'*X
score % unfortunately not the same as the above matrices
unfortunately, I discovered that score was not the same as $\tilde{X}^{(i)}$. What is it though? Thus, the points that I wanted to address were:
- What does
scoreactually represents? What is a mathematical and intuitive explanation of what it is? - If I want to use PCA as the tool to reconstruct vectors (or say images) as in a linear auto-encoder (aka PCA) should I use the variable
scoreor should I use what I understand as a reconstruction $ \tilde{X}_N = U U^T X_N$?
After doing some more digging in that documentation I found that one can make what I call a reconstruction with the following code:
X_tilde_score = ( score * coeff' + + repmat(mu,[N,1]) )';
Which translates in equations to:
$$ \tilde{X} = (score U^T + \bar{X})^T$$
where $\bar{X}$ is the concatenation of the mean vector $\bar{x} = \frac{1}{N} \sum^N_{i=1} x^{(i)}$.
After some rearranging one can get:
$$ scores = U^T (\tilde{X} - \bar{X}) = U^T(X - \bar{X})$$
which seems a little weird to me because that is not what I would have called "representations of X in the principal component space". It doesn't even seem to be a projection because it does not even obey $P^2 = P$ (since $U^TU^T$ doesn't make sense as its rectangular). Then I was wondering what were the developers thinking when they defined scores? Why would returning such a thing be good instead of $\tilde{X}$? Is there something about PCA I don't know or that I don't understand and hence, why I miss the purpose of score? Why is it meaningful to define scores that way? (I don't think they "wrong" or its a bad definition, I genuinely want to understand the motivation for such a definition for score)
If it helps to understand my perspective (and why I might be asking this for someone who thinks its an obvious answer) I mostly come from a Machine Learning, Linear Algebra and Computer Science background. In particular, I find auto-encoders interesting right now.
USor equivalently asXV(actually, if PCA is performed on the covariance matrix rather than scatter matrix, thensqrt(n)USwill stand in place ofUS, wherenis the number of raws). What you are computing in your example is not this. – ttnphns Mar 20 '16 at 08:17sqrt(n)Uis what usually called standardized PC scores. – ttnphns Mar 20 '16 at 08:29