1

If I had a feature vector, X, and applied PCA or EFA to reduce it to a single-dimension variable, should we expect that number to have strong correlations with each of its high-dimension constituents?

LogCapy
  • 105
  • 6

2 Answers2

0

Often it will, but not always. It depends on a) The number of variables and b) How strongly each of them relate to the component or factor.

For one example, suppose there are 10 orthogonal variables that all relate to one component. Then we would not expect high correlations between the individual variables and the factor score.

Or, suppose we have 10 variables, all highly correlated. Then we would expect high correlations between all the variables and the factor.

Or, suppose we had one variable that was a very good representation of the latent variable (or component) and 9 that added little information. The only the first should have a high correlation.

If there are a lot of variables thaat have low correlation with the factor then it may be a sign that you need to extract more factors or drop some variables.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
0

Yes, but only if you have a good case for PCA, i.e. all original features are represented in the first PC and PC1 explains most variance (correlation).

Consider this representation of PCA of the feature matrix $X_{ki}$ where the variables $i$ are in columns and observations $k$ are in rows: $$X'X=VDV'$$ Here $V_{ij}$ is the PCA loadings in columns, i.e. coefficient of variable $i$ in the PC $j$; and $D$ is a diagonal matrix where $D_{jj}$ is the explained sums of squares (proportional to variances) of PC $j$. Let's re-arrange: $$X'XV=VD$$

Since the scores or principal components are $XV$, also in columns, the left side represents covariance matrix of the original features and principal components. If your features are normalized, then it also represents the correlation matrix.

So, if we look at the first column $V_{i1}D_{11}$ of $VD$ matrix, it represents the covariance (correlation) of the features $i$ to the first component (score) $V_{i1}$. You can see that the covariance (correlation) is high when:

  • the variable $i$'s coefficient $V_{i1}$ is high in the first component
  • the explained variance (correlation) $D_{11}$ is high of the first component

As you may notice these two conditions are the ones when PCA works best, i.e. the first component explains almost all variance (correlation) and all original features are represented well in it.

Simulation Examples

Here's an example of a nearly ideal condition for PCA, where the original features are highly correlated in two dimensional case:

enter image description here

Correlation of PC1 with original features is high:

> cor(X1,pZ1$x)
         PC1
x  0.9397134
y1 0.9412734

The full R simulation code:

x = runif(100) - 0.5
y = runif(100) - 0.5
rho=0.8
y1 = rho*x+sqrt(1-rho^2)*y
X1=cbind(x,y1)
plot(X1)
pZ1=prcomp(X1,rank.=1)

The next example is for the case when PCA doesn't really work: two uncorrelated variables.

enter image description here

Here the correlation of PC1 with one of the variables is high and with the other is low:

> cor(X,pZ$x)
         PC1
x -0.2858234
y  0.9693105

The reason is that PCA cannot reduce the dimensionality of the features matrix, so its PC1 is basically one of the variables, while PC2 is the other variable.

The full R simulation code:

X=cbind(x,y)
plot(X)
pZ=prcomp(X,rank.=1)
Aksakal
  • 61,310