2

I recently ran a PCA on a dataset of self-report data from 226 subjects to zoom in on which specific individual differences might account for participants’ predicted choices in a separate task we have them complete. Here are the resulting loading scores for PCs 1,2, and 3:

enter image description here

I added another 67 subjects to the dataset, reran everything exactly the same, and out came these loading scores:

enter image description here As you can see, many of the self-report loading scores that were positive or negative for PC1 and PC2 in the first run (n=226) are now negative for the same PCs in the updated run (n=293), yet the magnitude stays more or less the same. This is puzzling to me, seeing as we have only added 67 subjects. How can we explain the switching of signs of the loading scores with just the addition of 67 subjects?

Let me know if I should elaborate on any of this. Any and all guidance would be much appreciated!

Thank you so much in advance.

Mel
  • 21
  • 5
    What about the eigenvalues? Have they changed as well? Be careful however that the loadings are determined up to the sign. In the sense, that if you multiply all the loadings by -1, they are still valid PCA loadings. – utobi Feb 02 '23 at 21:20
  • 3
    I wonder if "adding data" is a red herring. Eigenvectors of the covariance matrix are only identified up to a nonzero multiple. If the eigenvector's norm is constrained to be 1, then the eigenvector is only identified up to a change in sign. This is a property of eigenvectors themselves: $\Sigma v = \lambda v \Leftrightarrow \Sigma(-v) = \lambda (-v)$ for an eigenvector $v$. – Sycorax Feb 02 '23 at 21:26
  • @Sycorax This is actually an answer, isn't it? – Christian Hennig Feb 02 '23 at 21:51
  • There is a question of the extent to which adding data could change the result for other reasons, but I think that this explanation probably accounts for most of the observed difference. – Sycorax Feb 02 '23 at 22:05

0 Answers0