1

I just did my first full PCA on the Iris dataset. I understand the math behind PCA, but less so the statistics. I started by plotting the two variables that had the highest co-variance, just for the fun of comparing their plot to the final PCA (after reducing to two PCs). The result surprised me. There was the same general shape in the scatter plot, but it appears that the slopes were opposite in nature. This makes me think maybe I messed up somewhere.

Here is the graph of the variables that I calculated as having the highest co-variance: enter image description here

And then, here is the graph of my top two principle components:

enter image description here

My full calculations are here, in a Jupyter notebook. I was wondering if there are any obvious errors, or if this sort of behavior is possible:

https://github.com/wcneill/jupyter_practice/blob/master/seaborn-practice.ipynb

rocksNwaves
  • 380
  • 1
  • 11

1 Answers1

0

The points in the plots are computed as $\langle \vec{x},\vec{e}_i\rangle$, where $\vec{e}_i$ are eigenvectors of the covariance matrix $\Sigma$: $$\Sigma\cdot \vec{e}_i = \lambda_i \vec{e}_i$$ This equation still holds if $\vec{e}_i$ is multiplied on both sides with the same arbitrary factor. Usually, the eigenvectors are normalized to $|\vec{e}_i|=1$, but even this normalization does not uniquely determine $\vec{e}_i$, because it also holds for $-\vec{e}_i$.

cdalitz
  • 5,132