7

I'm trying to understand the process for statistical testing for principal component analysis or partial least squares.

Step 1. PCA: I feel that I have a not-terrible understanding of PCA: You find the ellipsoid described by the covariance matrix of the data, and then successively take the largest axis of variation (principal component 1), then the second largest (principal component 2), and so on. If the ellipsoid is long and stretched, then the variation is mostly along the first principal component (the eigenvector corresponding to the largest eigenvalue of the ellipsoid). If the ellipsoid is a planar "disc", then the variation in the data is explained well by two principal components, etc.

I also understand that after choosing to use (for example) only the first two principal components, then all of the data points can be plotted on a "Scores" plot that shows, for each data point $D^{(i)}$, the projection of $D^{(i)}$ into the plane spanned by the first two principal components. Likewise, for the "Loadings" plot (I think) you write the first and second principal components as linear combinations of the input variables and then for each variable, plot the coefficients that it contributes to the first and second principal components.

Step 2. PLS or PLS-DA: If there are labels on the data (let's say binary classes), then build a linear regression model to use the first and second principal components to discriminate class 0 (for data point $i$, that means $Y^{(i)}=0$) from class 1 (for data point $i$, that means $Y^{(i)}=1$) by first projecting all data to only lie in the plane spanned by the first and second principal components, and then regressing the projected input data $X_1', X_2'$ to $Y$. This regression could be written as (first step) the affine transformation (i.e. linear transformation + bias) that projects along $PC_1, PC_2$ (the first and second principal components), and then (second step) a second affine transformation that predicts $Y$ from $PC_1, PC_2$. Together these transformations $Y \approx Affine(Affine(X))$ can be written as a single affine transformation $Y \approx C (A X + B) + D = E X + F$.

Step 3. Testing variables from $X$ for significance in predicting the class $Y$: This is where I could use some help (unless I'm way off already, in which case tell me!). How do you test whether an input variable (i.e. a feature that has not yet been projected onto the principal components (hyper)plane), and decide if it has a statistically significant coefficient in the regression $Y \approx E X + F$? Qualitatively, a coefficient in $E$ that is further from zero (i.e. positive and negative values with large magnitude) indicates a larger contribution from that variable.

I remember seeing linear regression t-tests for normally distributed data (to test whether the coefficients were zero). Is this the standard approach? In that case, I would guess that ever variable from $X$ has been transformed to have a roughly normal distribution in Step 0 (i.e. before any of these other steps are performed).

Otherwise, I could see performing a permutation test (by running this entire procedure thousands of times and each time permuting $Y$ to shuffle the labels, and then comparing each single coefficient in $E$ from the un-shuffled analysis to the distribution of coefficients from shuffled analyses).

Can you help me see anywhere my intuition is failing? I've been trying to look through papers using similar procedures to see what they did, and as is often the case, they're clear as mud. I'm preparing a tutorial for some other researchers, and I want to do a good job.

user
  • 255
  • 1
    If you are using PCs from a PCA within some other procedure, their origin in PCA is immaterial for significance testing with that other procedure. That is a little contentious, as statistical people don't all agree on whether PCA is a multivariate transformation procedure or model estimation, but I think it is a good first approximation. If that argument is accepted, then your question is just about significance testing in whatever you are doing and covered by any standard account. Do you regard linear regression and PLS as equivalent? You seem unclear which you are using. – Nick Cox Nov 02 '13 at 12:13
  • @NickCox Thanks, good comment. With PCA, do you think the description (as the first part of "significant" feature selection where the features have covariation) is correct? And for LinReg vs. PLS: I was indeed conflating linear regression for a single variate binary $Y$ variable as equivalent to PLS, but come to think of it, I'm not sure why (I guess I thought minimum square error would also be maximum margin discrimination)-- is it not true? – user Nov 02 '13 at 21:01
  • Sorry, but I don't understand what you are seeking here either from me or from the site. As you say, linear regression does not mean PLS, or vice versa. – Nick Cox Nov 03 '13 at 10:16
  • I am trying to get a good understanding of what type of significance testing procedure (from start to finish) is used when doing PCA-PLS (sometimes called PLS-DA). This is commonly used in processing metabolomic data (http://goo.gl/TPM3iV does not really describe the statistical testing). 2) From the Wikipedia article on Partial Least Squares: "it finds a linear regression model by projecting the predicted variables and the observable variables to a new space". I meant performing a linear regression in the transformed space ($PC_1$ and $PC_2$).
  • – user Nov 03 '13 at 14:19
  • I can't say much more. Perhaps you need to ask in some chemometrics forum, so no expert here is biting on this yet. A significance testing procedure for the whole of what you do would need a probability model for the whole of what you do. On the face of it, that would be a lot of work to set up and evaluate. – Nick Cox Nov 03 '13 at 16:08
  • Still waiting on an explanation for why the Wikipedia page says that "Partial Least Squares... finds a linear regression model...". If you don't know, then it's hard to blame you-- I don't know why it says that either if the two are so diametrically opposed as you say! – user Nov 04 '13 at 16:33
  • They are not synonymous; that's all I am saying. You'd better ask a new question. – Nick Cox Nov 04 '13 at 16:55
  • I wonder why Wikipedia would say that? – user Nov 04 '13 at 21:45
  • You may want to check out the recent publication on this subject http://www.researchgate.net/publication/264936800_Interpretation_of_Variable_Importance_in_Partial_Least_Squares_with_Significance_multivariate_correlation_(SMC) Success ! Thanh Tran –  Oct 04 '14 at 17:21