1

This question is closely related to [Is it acceptable to reverse a sign of a principal component score?.

I have a matrix A with Samples in row and Features in column. I apply PCA on A and get a matrix of Samples in PC space (PCA_1). In this reduced space, I calculate correlation between samples.

Now, I re-do the exact similar process. As PC sign is arbitrary, I might get a different matrix of Samples in PC space (PCA_2 with an inverted sign in PC2). Nothing to worry about, the interpretation is the same. But now that I calculate the correlation between samples, it is quite different.

>PCA_1           PC1 PC2 PC3   >PCA_2          PC1 PC2 PC3
         Sample1   1   1  -2            Sample1   1  -1  -2
         Sample2   2  -2  -4            Sample2   2   2  -4
         Sample3   4  -3  -6            Sample3   4   3  -6

I created a reproducible example in R, creating example matrix in PC space with different signs : I didn't manage to reproduce pca with different signs on my computer, but I know it happens when running on 2 different computers)

PCA_1 = matrix(c(1,1,-2,2,-2,-4,4,-3,-6),byrow = T,nrow = 3,dimnames =  list(c("Sample1","Sample2","Sample3"),c("PC1","PC2","PC3")))
PCA_2 = matrix(c(1,-1,-2,2,2,-4,4,3,-6),byrow = T,nrow = 3,dimnames =  list(c("Sample1","Sample2","Sample3"),c("PC1","PC2","PC3")))

PCA_1
PCA_2

cor(t(PCA_1))
cor(t(PCA_2))

Which produces

> cor(t(PCA_1))
          Sample1   Sample2   Sample3
Sample1 1.0000000 0.7559289 0.7313071
Sample2 0.7559289 1.0000000 0.9993217
Sample3 0.7313071 0.9993217 1.0000000
> cor(t(PCA_2))
          Sample1   Sample2   Sample3
Sample1 1.0000000 0.7559289 0.8122396
Sample2 0.7559289 1.0000000 0.9958706
Sample3 0.8122396 0.9958706 1.0000000

The correlation between Sample1 & Sample3 are different. Why ? How to know which one is closer to the truth ?

Paquito
  • 11
  • 3

1 Answers1

0

Here you are basically computing the correlation of a matrix. I think it is correct that the correlations are different. Try to do correlation computations by hand: first the mean of Sample1, then the sum of differences from the mean, (following the definition )

>>> s1_mean=(1+1-2)/3
>>> s1_mean
0.0
>>> ((1-s1_mean)+(1-s1_mean)+(-2-s1_mean))
0.0
>>>

while for the transformed data:

>>> s1_mean=(1-1-2)/3
>>> s1_mean
-0.6666666666666666
>>> ((1-s1_mean)+(-1-s1_mean)+(-2-s1_mean))
-4.440892098500626e-16

Beware that you are changing the sign of a single covariate, but then you are transposing the matrix before computing correlations. This yields a new set of covariates in which each one has a single element multiplied by -1.

If you remove the transpose the covariances between PC1 and PC3 will be the same, and the covariances between PC1/3 and PC2 will have opposite sign

> cor(PCA_1)
           PC1        PC2        PC3
PC1  1.0000000 -0.8910421 -0.9819805
PC2 -0.8910421  1.0000000  0.9607689
PC3 -0.9819805  0.9607689  1.0000000
> cor(PCA_2)
           PC1        PC2        PC3
PC1  1.0000000  0.8910421 -0.9819805
PC2  0.8910421  1.0000000 -0.9607689
PC3 -0.9819805 -0.9607689  1.0000000

Note that (Xi − X)(Yi − Y) is positive if and only if Xi and Yi lie on the same side of their respective means. Thus the correlation coefficient is positive if Xi and Yi tend to be simultaneously greater than, or simultaneously less than, their respective means

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Interpretation

If you change the sign of one of two variates, and the two variates before were lying on the same side of the respective means, now they are lying on opposite sides.