1

Today, after learning about performing $PCA$ using $SVD$, I know $PCA$ will choose $K$ components that have the highest eigenvalues. I have a question which feature will correspond to which eigenvalue?

I mean, I have data matrix $\hat{\mathbf{X}}$ have list columns, each column name correspond with one column in $\hat{\mathbf{X}}$, after using $SVD$ on $\hat{\mathbf{X}}$, I will have list eigenvalues sorted, and I don't know which column will correspond to which eigenvalue.

# list columns in data
selected_col=['temperatureMax', 'dewPoint', 'cloudCover', 'windSpeed', 'pressure',
            'visibility', 'humidity', 'uvIndex', 'temperatureMin']

x=df[selected_col]

Standardized

X_std = standardized(X)

Use SVD on X

U, S, VT = np.linalg.svd(X_std, full_matrices=False)

Now I have S containing eigenvalues was descending sort, so I don't know what column is corresponding with the highest eigenvalue, is it 'temperatureMax', 'cloudCover' or another?

Any help is highly appreciated. Thank you.

Edit:

Thanks for answer from @utobi. From this answer i was found how to find which variables are most correlated with the each PC in here, hope it's useful.

manh3
  • 113

1 Answers1

3

Let $\mathbb{X}$ be the $n\times p$ matrix of observations which has been centred so that each of its columns has an average 0. Now consider the singular value decomposition (SVD)

$$ \mathbb{X} = U D V^\top, $$

where $U$ and $V$ are orthogonal matrices and $D$ is the matrix of singular values. Assuming $\mathbb{X}$ has (full) rank $p$, $U$ is a $n\times p$, D is a diagonal $p\times p$ matrix and $V$ is also $p\times p$.

The principal components (PC) of $\mathbb{X}$ are the columns of

$$\mathbb{Y} = UD.$$

If we denote by $S = (n-1)^{-1}\mathbb{X}^\top \mathbb{X}$, the sample covariance matrix of $\mathbb{X}$, then $\mbox{diag}(D)^2/(n-1)$ coincide with the eigenvalues of $S$. Thus, once you have performed the SVD, you already have the PCs.

Now I have S containing eigenvalues was descending sort, so I don't know what column is corresponding with the highest eigenvalue, is it 'temperatureMax', 'cloudCover' or another?

To answer this question is best to look at PCA from a different perspective. It turns out that the $i$th PC, that is any column of $\mathbb{Y}$ $y_{\bullet i}$, can be seen as a linear combination of the columns of $\mathbb{X}$

$$ y_{\bullet i} = a_{i1} x_{\bullet 1}+\cdots a_{ip} x_{\bullet p}, $$ where $a_{i1},\ldots,a_{ip}$ is a unit vector and such that the sample variance of $y_{\bullet i}$ is maximised under the constraint that $y_{\bullet i}$ is orthogonal to $y_{\bullet j}$ for all $1\leq j< i$. The PCs are sorted in terms of their variances (e.g. the eigenvalues of $S$). The first PC is the most important, i.e. having the highest variance, the second is more important than the third, and so on.

In simple words, what this means is that a PC is a linear combination of all original variables, thus it is not possible to answer your question. What you can tell from the output of a PCA is, for instance:

Which is the variable that contributes the most to a given PC ?

The answer to this question is

the variable $x_{\bullet i}$ with the largest $a_{ij}$, or the variable having the highest correlation with that PC.

utobi
  • 11,726
  • Thanks, you save my time. I can understand that a function of the original data cannot decide an entire PC. It's clear. – manh3 Jul 12 '23 at 07:06