I conducted a PCA on dichotomous variables (0's and 1's). The dataset consists of human subjects and a few thousand genetic variants, where the presence of a genetic variant is indicated with 0's and 1's.
My first PC correlates >.9 with the nr of 1's in a subject.
Is this expected?
Could this be an consequence of the fact that PCAs are actually not meant for binary data?
Or does this simply mean that subjects with more 1's resemble other subjects with more 1's (i.e., the more genetic variants are present, the more likely it is that those are the same variants as in another individuals with an approximately equal amount of genetic variants).
Or could there be an alternative explanation?
I hope the problem is well specified, otherwise please let me know! Many thanks!
The number of 1's per subject is also not standardized (just a column with the sums of all 1's for each subject/row).
– Abdel May 15 '13 at 17:01either the correlation or the covariance. Linear PCA works (and is being done) on any scalar-product similarity. It may be covariances or correlations or cosines or raw SSCPs. In the latter two cases no mean subtraction occure which affects the PCs greatly. – ttnphns May 15 '13 at 20:09