PCA on correlation or covariance: does PCA on correlation ever make sense?

Question

In principal component analysis (PCA), one can choose either the covariance matrix or the correlation matrix to find the components (from their respective eigenvectors). These give different results (PC loadings and scores), because the eigenvectors between both matrices are not equal. My understanding is that this is caused by the fact that a raw data vector $X$ and its standardization $Z$ cannot be related via an orthogonal transformation. Mathematically, similar matrices (i.e. related by orthogonal transformation) have the same eigenvalues, but not necessarily the same eigenvectors.

This raises some difficulties in my mind:

Does PCA actually make sense, if you can get two different answers for the same starting data set, both trying to achieve the same thing (=finding directions of maximum variance)?
When using the correlation matrix approach, each variable is being standardized (scaled) by its own individual standard deviation, before calculating the PCs. How, then, does it still make sense to find the directions of maximum variance if the data have already been scaled/compressed differently beforehand? I know that that correlation based PCA is very convenient (standardized variables are dimensionless, so their linear combinations can be added; other advantages are also based on pragmatism), but is it correct?

It seems to me that covariance based PCA is the only truly correct one (even when the variances of the variables differ greatly), and that whenever this version cannot be used, correlation based PCA should not be used either.

I know that there is this thread: PCA on correlation or covariance? -- but it seems to focus only on finding a pragmatic solution, which may or may not also be an algebraically correct one.

I'm going to be honest and tell you I quit reading your question at some point. PCA makes sense. Yes, the results may be different depending on whether you choose to use the correlation or variance/covariance matrix. Correlation based PCA is preferred if your variables are measured on different scales, but you don't want this to dominate the outcome. Imagine if you have a series of variables that range from 0 to 1 and then some that have very large values (relatively speaking, like 0 to 1000), the large variance associated with the second group of variables will dominate. — Patrick, Jun 27 '13 at 00:30
I changed the title, to mark the difference with previous questions on the topic. I hope the new title is OK. — Gala, Jun 27 '13 at 07:09
@ Patrick: (1) please read the full question before answering, as a courtesy & generally sensible approach. (2) Your example illustrates the point: if I convert the [0,1000] interval to dBA or any log scale, the data now range from -\infty to 30, i.e., the values originally close to zero (say, 0.001) are stretched and get much further away from the new (log) center than does the original 1000. Scaling (including dividing by individual s.d) enables data points -- particularly outliers -- to be moved to almost anywhere. This is the case even of all variables are measured on the same scale. — Lucozade, Jun 27 '13 at 09:23
But that's the case with many other techniques as well and I think Patrick's point is reasonable. Also it was merely a comment, no need to become aggressive. Generally speaking, why would you assume that there should be one true “algebraically” correct way to approach the problem? — Gala, Jun 27 '13 at 09:55
@ Gael: Because both approaches claim to solve the same problem (see pt. 1 of my answer to ttnphs). Moreover, in e.g. linear regression, there are a set of specific conditions that must be satisfied to be able to use the method. Between cov-PCA and corr-PCA, I have not yet seen (a) clear rule(s) or division when each of these should/should not be applied, how both methods diverge/converge under which conditions, etc. PS: I did not intend any agression; on the contrary. Perhaps this rather applies to anyone who writes "I quit reading your question", but still comments nevertheless. — Lucozade, Jun 27 '13 at 11:17
Perhaps you're thinking of PCA in the wrong way: it's just a transformation, so there's no question of its being correct or incorrect, or relying on assumptions about the data model - unlike, say, regression or factor analysis. — Scortchi - Reinstate Monica, Jun 27 '13 at 11:37
The crux of this matter appears to rest on a misunderstanding of what standardization does and how PCA works. This is understandable, because a good grasp of PCA requires visualization of higher-dimensional shapes. I would maintain that this question, like many other questions based on some sort of misapprehension, is thereby a good one and ought to remain open, because its answer(s) can reveal truths that many people might not have fully appreciated before. — whuber, Jun 27 '13 at 14:36
PCA does not “claim” anything. People make claims about PCA and in fact use it very differently depending on the field. Some of these uses might be silly or questionable but it does not seem very enlightening to assume that a single variant of the technique must be the “algebraically correct” one with no reference to the context or goal of the analysis. — Gala, Jun 27 '13 at 20:54

ttnphns · Answer 1 · 2013-06-27T17:28:47.310

I hope these responses to your two questions will calm your concern:

A correlation matrix is a covariance matrix of the standardized (i.e. not just centered but also rescaled) data; that is, a covariance matrix (as if) of another, different dataset. So it is natural and it shouldn't bother you that the results differ.
Yes it makes sense to find the directions of maximal variance with standardized data - they are the directions of - so to speak - "correlatedness," not "covariatedness"; that is, after the effect of unequal variances - of the original variables - on the shape of the multivariate data cloud was taken off.

Next text and pictures added by @whuber (I thank him. Also, see my comment below)

Here is a two-dimensional example showing why it still makes sense to locate the principal axes of standardized data (shown on the right). Note that in the right hand plot the cloud still has a "shape" even though the variances along the coordinate axes are now exactly equal (to 1.0). Similarly, in higher dimensions the standardized point cloud will have a non-spherical shape even though the variances along all axes are exactly equal (to 1.0). The principal axes (with their corresponding eigenvalues) describe that shape. Another way to understand this is to note that all the rescaling and shifting that goes on when standardizing the variables occurs only in the directions of the coordinate axes and not in the principal directions themselves.

What is happening here is geometrically so intuitive and clear that it would be a stretch to characterize this as a "black-box operation": on the contrary, standardization and PCA are some of the most basic and routine things we do with data in order to understand them.

Continued by @ttnphns

When would one prefer to do PCA (or factor analysis or other similar type of analysis) on correlations (i.e. on z-standardized variables) instead of doing it on covariances (i.e. on centered variables)?

When the variables are different units of measurement. That's clear.
When one wants the analysis to reflect just and only linear associations. Pearson r is not only the covariance between the uniscaled (variance=1) variables; it is suddenly the measure of the strength of linear relationship, whereas usual covariance coefficient is receptive to both linear and monotonic relationship.
When one wants the associations to reflect relative co-deviatedness (from the mean) rather than raw co-deviatedness. The correlation is based on distributions, their spreads, while the covariance is based on the original measurement scale. If I were to factor-analyze patients' psychopathological profiles as assesed by psychiatrists' on some clinical questionnaire consisting of Likert-type items, I'd prefer covariances. Because the professionals are not expected to distort the rating scale intrapsychically. If, on the other hand, I were to analyze the patients' self-portrates by that same questionnaire I'd probably choose correlations. Because layman's assessment is expected to be relative "other people", "the majority" "permissible deviation" or similar implicit das Man loupe which "shrinks" or "stretches" the rating scale for one.

score 6 · Answer 2 · edited Jun 28 '13 at 21:48

Speaking from a practical viewpoint - possibly unpopular here - if you have data measured on different scales, then go with correlation ('UV scaling' if you are a chemometrician), but if the variables are on the same scale and the size of them matters (e.g. with spectroscopic data), then covariance (centering the data only) makes more sense. PCA is a scale-dependent method and also log transformation can help with highly skewed data.

In my humble opinion based on 20 years of practical application of chemometrics you have to experiment a bit and see what works best for your type of data. At the end of the day you need to be able to reproduce your results and try to prove the predictability of your conclusions. How you get there is often a case of trial and error but the thing that matters is that what you do is documented and reproducible.

The practical approach you seem to advocate here boils down to - when both covariances and correlations are warranted - "try both and see what works best". That pure empirical stance masks the fact that any choice goes with its own assumptions or paradigm about the reality which the researcher ought to be aware of in advance, even if he understands that he prefers one of them fully arbitrarily. Selecting "what works best" is the capitalizing on the feeling of pleasure, the narcomania. — ttnphns, Jun 28 '13 at 23:26

Lucozade · Answer 3 · 2013-06-28T20:17:59.510

-1

I have no time to go into a fuller description of detailed & technical aspects of the experiment I described, and clarifications on wordings (recommending, performance, optimum) would again divert us away from the real issue, which is about what type of input data the PCA can(not) / should (not) be taking. PCA operates by taking linear combinations of numbers (values of variables). Mathematically, of course, one can add any two (real or complex) numbers. But if they have been re-scaled before PCA transformation, is their linear combination (and hence to process of maximization) still meaningful to operate on? If each variable $x_i$ has same variance $s^2$, then clearly yes, because $(x_1/s_1)+(x_2/s_2)=(x_1+x_2)/s$ is still proportional and comparable to the physical superposition of data $x_1+x_2$ itself. But if $s_1\not =s_2$, then the linear combination of standardized quantities distorts the data of the input variables to different degrees. There seems little point then to maximize the variance of their linear combination. In that case, PCA gives a solution for a different set of data, whereby each variable is scaled differently. If you then unstandardize afterwards (when using corr_PCA) then that may be OK and necessary; but if you just take the the raw corr_PCA solution as-is and stop there, you would obtain a mathematical solution, but not one related to the physical data. As unstandardization afterwards then seems mandatory as a minimum (i.e., 'unstretching' the axes by the inverse standard deviations), cov_PCA could have been used to begin with. If you are still reading by now, I am impressed! For now, I finish by quoting from Jolliffe's book, p. 42, which is the part that concerns me: 'It must not be forgotten, however, that correlation matrix PCs, when re-expressed in terms of the original variables, are still linear functions of x that maximize variance with respect to the standardized variables and not with respect to the original variables.' If you think I am interpreting this or its implications wrongly, this excerpt may be a good focus point for further discussion.

edited Jun 28 '13 at 20:17

answered Jun 28 '13 at 19:29

Lucozade

659

The missing reference here is Jolliffe, I.T. 2002. Principal component analysis. New York: Springer. [various misspellings of the author's name are common in citations] – Nick Cox Jun 28 '13 at 20:10
3

It is so amusing that your own answer, which is in tune with everything people here were trying to convey to you, remains unsettled for you. You still argue There seems little point in PCA on correlations. Well, if you need to stay close to raw data ("physical data", as you strangely call it), you really shouldn't use correlations since they correspond to another ("distorted") data. – ttnphns Jun 28 '13 at 20:42
2

(Cont.) Jolliffe's citation states, that PCs obtained on correlations will ever be themselves and cannot be turned "back" into PCs on covariances even though you can re-express them as linear combinations of the original variables. Thus, Jolliffe stresses the idea that PCA results are fully dependent on the type of pre-processing used and that there exist no "true", "genuine" or "universal" PCs... – ttnphns Jun 28 '13 at 20:43
2

(Cont.) And in fact, Several lines below Jolliffe speaks of yet another "form" of PCA - PCA on X'X matrix. This form is even "closer" to original data than cov-PCA because no centering of variables are being done. And the results are usually utterly different. You could also do PCA on cosines. People do PCA on all versions of the SSCP matrix, albeit covariances or correlations are used most often. – ttnphns Jun 28 '13 at 20:52
5

Underlying this answer is an implicit assumption that the units in which data are measured have an intrinsic meaning. That is rarely the case: we may choose to measure length in Angstroms, parsecs, or anything else, and time in picoseconds or millennia, without altering the meaning of the data one iota. The changes made in going from covariance to correlation are merely changes of units (which, by the way, are particularly sensitive to outlying data). This suggests the issue is not covariance versus correlation, but rather to find fruitful ways to express the data for analysis. – whuber Jun 28 '13 at 21:45
@whuber: I can't help approving your wise remarks... with the exception of that hazy going from covariance to correlation are merely changes of units. It is "merely" this if all the variances are equal (and all variables are same units), else implications are more profound (e.g. see my answer). – ttnphns Jun 28 '13 at 22:59
3

@ttnphns I'll stick by the "merely," thanks. Whether or not the implications are "profound," the fact remains that standardization of a variable literally is an affine re-expression of its values: a change in its units of measure. The importance of this observation lies in its implications for some claims appearing in this thread, of which the most prominent is "covariance-based PCA is the only truly correct one." Any conception of correctness that ultimately depends on an essentially arbitrary aspect of the data--how we write them down--cannot be right. – whuber Jun 30 '13 at 12:59
@whuber: the issue is that correlation PCA comes built-in, by definition, with standardization from x to z as a forward transformation of the original data. However, the reverse transformation that brings the PCs back to those of the pertinent x (and not z) is the vital (often) missing end operation for correlation PCA. If this operation is added, the correlation PCA reduces to the covariance PCA. Geometrically, in terms of stretching and unstretching the hyperplane of best fit, this necessity is quite obvious. One cannot claim to have solved the problem for z only if data was given for x. – Lucozade Jul 01 '13 at 13:50
@whuber: Underlying this answer is an implicit assumption that the units in which data are measured have an intrinsic meaning. . This may be assumed by some, but not by me. Rather, it is the PCA solution that is intrinsicically linked to the units of the input data. If you take away the link by standardization, without re-expression in terms of the original units, what is the use and applicability of the solution. – Lucozade Jul 01 '13 at 13:59
@ttnphns: in my experience, one always starts from raw data as the starting point of a PCA analysis. Cases where your input data of interest (dimensioned or dimensionless) are already centered and sphered seem pathological. – Lucozade Jul 01 '13 at 14:08
@ttnphns: ...if you need to stay close to raw data...: to me, that seems always a necessity. Can you give an example where you have liberty to transform your data and have liberty not to back-transform your solution to the original data & units? This would be like using substiution method for solving an integral and not bother expressing the solution for the transformed variable back terms of the original variable of integration. – Lucozade Jul 01 '13 at 14:21
1

"Geometrically [quite] obvious," maybe: but not true! The PCA for the correlation, when back-transformed to the original units, is not the PCA for the covariance. That would amount to a claim that orthogonal matrices commute with arbitrary diagonal matrices, which is easily refuted algebraically or by counterexample. – whuber Jul 01 '13 at 14:32
@ttnphns: ...and cannot be turned "back" into PCs on covariances...you can re-express them as linear combinations of the original variables. This seems contradictory to me. Can you clarify what you mean here? I have shown using my own data that I can indeed retrieve the cov_PCs from the corr_PCs (based on an independent check), which is of course no surprise because raw or centered and standardized data are linked via a linear transformation, hence their PCs are also linked deterministically. – Lucozade Jul 01 '13 at 14:34
@Lucozade: Gosh! I can indeed retrieve the cov_PCs from the corr_PCs Please, go and show it as addendum in your answer. (Really, any one cov PC is the linear combination of the entire set of corr PCs, and vice versa. So what?) – ttnphns Jul 01 '13 at 15:19

PCA on correlation or covariance: does PCA on correlation ever make sense?

3 Answers3

Linked

Related