36

In principal component analysis (PCA), one can choose either the covariance matrix or the correlation matrix to find the components (from their respective eigenvectors). These give different results (PC loadings and scores), because the eigenvectors between both matrices are not equal. My understanding is that this is caused by the fact that a raw data vector $X$ and its standardization $Z$ cannot be related via an orthogonal transformation. Mathematically, similar matrices (i.e. related by orthogonal transformation) have the same eigenvalues, but not necessarily the same eigenvectors.

This raises some difficulties in my mind:

  1. Does PCA actually make sense, if you can get two different answers for the same starting data set, both trying to achieve the same thing (=finding directions of maximum variance)?

  2. When using the correlation matrix approach, each variable is being standardized (scaled) by its own individual standard deviation, before calculating the PCs. How, then, does it still make sense to find the directions of maximum variance if the data have already been scaled/compressed differently beforehand? I know that that correlation based PCA is very convenient (standardized variables are dimensionless, so their linear combinations can be added; other advantages are also based on pragmatism), but is it correct?

It seems to me that covariance based PCA is the only truly correct one (even when the variances of the variables differ greatly), and that whenever this version cannot be used, correlation based PCA should not be used either.

I know that there is this thread: PCA on correlation or covariance? -- but it seems to focus only on finding a pragmatic solution, which may or may not also be an algebraically correct one.

ttnphns
  • 57,480
  • 49
  • 284
  • 501
Lucozade
  • 659
  • 7
    I'm going to be honest and tell you I quit reading your question at some point. PCA makes sense. Yes, the results may be different depending on whether you choose to use the correlation or variance/covariance matrix. Correlation based PCA is preferred if your variables are measured on different scales, but you don't want this to dominate the outcome. Imagine if you have a series of variables that range from 0 to 1 and then some that have very large values (relatively speaking, like 0 to 1000), the large variance associated with the second group of variables will dominate. – Patrick Jun 27 '13 at 00:30
  • I changed the title, to mark the difference with previous questions on the topic. I hope the new title is OK. – Gala Jun 27 '13 at 07:09
  • 1
    @ Patrick: (1) please read the full question before answering, as a courtesy & generally sensible approach. (2) Your example illustrates the point: if I convert the [0,1000] interval to dBA or any log scale, the data now range from -\infty to 30, i.e., the values originally close to zero (say, 0.001) are stretched and get much further away from the new (log) center than does the original 1000. Scaling (including dividing by individual s.d) enables data points -- particularly outliers -- to be moved to almost anywhere. This is the case even of all variables are measured on the same scale. – Lucozade Jun 27 '13 at 09:23
  • 4
    But that's the case with many other techniques as well and I think Patrick's point is reasonable. Also it was merely a comment, no need to become aggressive. Generally speaking, why would you assume that there should be one true “algebraically” correct way to approach the problem? – Gala Jun 27 '13 at 09:55
  • 1
    @ Gael: Because both approaches claim to solve the same problem (see pt. 1 of my answer to ttnphs). Moreover, in e.g. linear regression, there are a set of specific conditions that must be satisfied to be able to use the method. Between cov-PCA and corr-PCA, I have not yet seen (a) clear rule(s) or division when each of these should/should not be applied, how both methods diverge/converge under which conditions, etc. PS: I did not intend any agression; on the contrary. Perhaps this rather applies to anyone who writes "I quit reading your question", but still comments nevertheless. – Lucozade Jun 27 '13 at 11:17
  • 6
    Perhaps you're thinking of PCA in the wrong way: it's just a transformation, so there's no question of its being correct or incorrect, or relying on assumptions about the data model - unlike, say, regression or factor analysis. – Scortchi - Reinstate Monica Jun 27 '13 at 11:37
  • 6
    The crux of this matter appears to rest on a misunderstanding of what standardization does and how PCA works. This is understandable, because a good grasp of PCA requires visualization of higher-dimensional shapes. I would maintain that this question, like many other questions based on some sort of misapprehension, is thereby a good one and ought to remain open, because its answer(s) can reveal truths that many people might not have fully appreciated before. – whuber Jun 27 '13 at 14:36
  • 8
    PCA does not “claim” anything. People make claims about PCA and in fact use it very differently depending on the field. Some of these uses might be silly or questionable but it does not seem very enlightening to assume that a single variant of the technique must be the “algebraically correct” one with no reference to the context or goal of the analysis. – Gala Jun 27 '13 at 20:54

3 Answers3

36

I hope these responses to your two questions will calm your concern:

  1. A correlation matrix is a covariance matrix of the standardized (i.e. not just centered but also rescaled) data; that is, a covariance matrix (as if) of another, different dataset. So it is natural and it shouldn't bother you that the results differ.
  2. Yes it makes sense to find the directions of maximal variance with standardized data - they are the directions of - so to speak - "correlatedness," not "covariatedness"; that is, after the effect of unequal variances - of the original variables - on the shape of the multivariate data cloud was taken off.

Next text and pictures added by @whuber (I thank him. Also, see my comment below)

Here is a two-dimensional example showing why it still makes sense to locate the principal axes of standardized data (shown on the right). Note that in the right hand plot the cloud still has a "shape" even though the variances along the coordinate axes are now exactly equal (to 1.0). Similarly, in higher dimensions the standardized point cloud will have a non-spherical shape even though the variances along all axes are exactly equal (to 1.0). The principal axes (with their corresponding eigenvalues) describe that shape. Another way to understand this is to note that all the rescaling and shifting that goes on when standardizing the variables occurs only in the directions of the coordinate axes and not in the principal directions themselves.

Figure

What is happening here is geometrically so intuitive and clear that it would be a stretch to characterize this as a "black-box operation": on the contrary, standardization and PCA are some of the most basic and routine things we do with data in order to understand them.


Continued by @ttnphns

When would one prefer to do PCA (or factor analysis or other similar type of analysis) on correlations (i.e. on z-standardized variables) instead of doing it on covariances (i.e. on centered variables)?

  1. When the variables are different units of measurement. That's clear.
  2. When one wants the analysis to reflect just and only linear associations. Pearson r is not only the covariance between the uniscaled (variance=1) variables; it is suddenly the measure of the strength of linear relationship, whereas usual covariance coefficient is receptive to both linear and monotonic relationship.
  3. When one wants the associations to reflect relative co-deviatedness (from the mean) rather than raw co-deviatedness. The correlation is based on distributions, their spreads, while the covariance is based on the original measurement scale. If I were to factor-analyze patients' psychopathological profiles as assesed by psychiatrists' on some clinical questionnaire consisting of Likert-type items, I'd prefer covariances. Because the professionals are not expected to distort the rating scale intrapsychically. If, on the other hand, I were to analyze the patients' self-portrates by that same questionnaire I'd probably choose correlations. Because layman's assessment is expected to be relative "other people", "the majority" "permissible deviation" or similar implicit das Man loupe which "shrinks" or "stretches" the rating scale for one.
ttnphns
  • 57,480
  • 49
  • 284
  • 501
  • 1
  • Sorry, but this bothers a lot. To an external individual, the standardization is a black-box operation, part of the PCA pre-conditioning of data (also in ICA). He wants one answer for his (raw) input data, especially if it relates to physical (dimensioned) data for which the PCA output needs to be interpreted physically (i.e., in terms of unstandardized variables) as well.
  • – Lucozade Jun 27 '13 at 09:29
  • My understanding is that PCA maximizes variance (Joliffe, p. 2); covariance and correlation (do they have directions??) are not a primary concern or target, they are removing by diagolization of the correlation/covariance matrix anyway. If you take away the unequality of variances that defines the shape of the cloud, how can one still claim to find its direction(s) of maximum extent?
  • – Lucozade Jun 27 '13 at 09:37
  • PCA maximizes variance; covariance and correlation... are not a primary concern PCA maximizes out of multivariate variance, i.e. out of variance + co-variance. The shape of a data cloud lying in concrete dimensions (the variables) is described ("defined") by the variance-covariance matrix. If the variances along those dimensions are forced to be all equal the shape changes, but it can still remain ellipsoid and worth PC-analysing. – ttnphns Jun 27 '13 at 14:34
  • @Lucozade: point taken, and I am sorry for rudely not reading your question fully. I am not sure why standardization of variables is a "black box operation" as it is simply centering and standardizing. Would you be willing to offer an example of how a corrlelation-based PCA could be difficult to interpret relative to a covariance based PCA. It seems to me that if you have issues with the standardization, then you have the choice not to. The first step when considering which PCA to use is whether or not it is appropriate/required to standardize. – Patrick Jun 27 '13 at 20:35
  • Not sure if you use R, but this is a decent example I think of why/when standardizing is necessary: data(mtcars); biplot(prcomp(mtcars)); head(mtcars) # note the relatively large values for "disp" and "hp"; biplot(prcomp(mtcars, scale=T)) – Patrick Jun 27 '13 at 20:36
  • mtcars["PC1"] <-prcomp(mtcars[,c(3, 4, 5, 6, 7)], scale=T)$x[,1] ; with(mtcars, plot(mpg~PC1)) ; ... as an example of how correlation based PCA does make sense and can be useful. – Patrick Jun 27 '13 at 20:56
  • @Patrick: in short, my application is about finding a set of optimum spatial positions of a machine, whose performance in measured in terms of a voltage that each position produces, under different observations (parameter variation). The data have a large dynamic range (noise-like), though not considered as outliers. If I run corr_PCA (= on standardized voltage), it recommends different positions from those using cov_PCA (= centered voltages). I checked both solutions; the cov_PCA gives better performance than corr_PCA. If I unstandardize the corr_PCA solutions, it gives the cov_PCA. – Lucozade Jun 28 '13 at 10:13
  • Thank you all for your comments and inputs; very inspiring. I have highlighted a third part, to steer discussion back to the key issue (for me): corr_PCA vs. cov_PCA. @Gael: point taken; when I said focus on variance, indeed it does not mean ignoring covariance. – Lucozade Jun 28 '13 at 10:15
  • 1
    Your latest revision appears to be a re-assertion that "covariance based PCA is the only truly correct one". As the entirety of the responses so far are in essence "No; wrong way to think about it; and here's why" it is difficult to know how you expect to steer discussion against such overwhelming disagreement. – Nick Cox Jun 28 '13 at 10:22
  • @ttnphns: thanks for picture, which is indeed the classic way of looking at effects of scaling on PCA (also its effect on slope coefficient in linear regression). But I respectfully disagree with your text: (i) each individual PC direction should have full meaning, not combining two or more to get shape. E.g., if you are only interested in 1 dominant PC, you are looking for a direction, not an ellipse. (ii) if scaling affects in the direction of coordinate axes, this also affects the PC directions themselves, because the coordinates are influence the PC loadings. – Lucozade Jun 28 '13 at 10:24
  • @ Nick: no, I am well prepared to be proven wrong. But I am looking for a (to me) convincing argument or demonstration that corr_PCA really does/can give an optimum solution, over and above that obtained for cov_PCA -- or indeed vice versa. I know how to convert between the two solutions. – Lucozade Jun 28 '13 at 10:34
  • 4
    @Lucozade: I was confused about your description of your application:- How is PCA recommending anything? How did you measure performance? Similarly for your last comment:- The optimum for what? – Scortchi - Reinstate Monica Jun 28 '13 at 11:56
  • 5
    @Lucozade: Indeed, listen please what Scortchi said, you seem to continue chasing down spooks. PCA is simply a special form of rotating data in space. It always does optimally what it does with the input data. The cov-corr dilemma is a pragmatic one, rooted in data pre-processing and being solved at that level, not at the PCA level. – ttnphns Jun 28 '13 at 12:25
  • 1
    @Lucozade: It would be my (non-expert) opinion based on your reply to me that in your specific need, you are right to want cov-based PCA. Again, your variables are all homogeneous in terms of data/measurement type (same machine type, and all data in volts). To me your example is clearly a case where cov-PCA is correct, but please note that this is not always the case, and I think this the important point of this while thread (the choice of cor v. cov is case specific and needs to be determine by the person who understands the data & application best). Good luck with your research! – Patrick Jun 28 '13 at 12:55
  • "How is PCA recommending anything?" I was thinking this same thing, my guess is the machines are ordered along PC1, and this is what Lucozade wants to use to "find an optimum spatial positioning for his machines" (paraphrased) – Patrick Jun 28 '13 at 13:15
  • Why is "unstandardization afterward mandatory"? Why would this be true for descriptive/visualization purposes with heterogeneous data types? – Patrick Jun 28 '13 at 19:08