0

In various literature, it is stated that when 2 or more variables are highly correlated it causes problems in your PCA analysis. However, in the literature that I've read, it is not stated how high the correlation needs to be in order to be too high.

The solution is to remove variables that are highly correlated. My question is: how much correlation is high enough to cause problems in a PCA analysis, in other words, when is correlation too high?

EDIT: Asking in relation to ordination methods/ multivariate statistical analysis where the objective is to find directions of most change in the data/relations between variables.

Cordex
  • 65
  • Perfectly correlated variables do not pose any threat to PCA, they may be a threat to regression models for the induced multicollinearity (as answered by R. Hardy) – utobi Apr 16 '23 at 17:34
  • @utobi, do you indeed mean perfectly? – Richard Hardy Apr 16 '23 at 17:40
  • @RichardHardy Yes, perfectly correlated variables is supposed to be the (long) subject of that sentence. – utobi Apr 16 '23 at 17:43
  • @utobi, hm, OK, but then the weights of the linear combinations for obtaining the PCs from the original variables are not uniquely determined, or are they? And if they are not, I would regard that as "trouble" for the PCA. – Richard Hardy Apr 16 '23 at 17:52
  • @RichardHardy yes but the loadings are not fully uniquely determined anyway since -1*old_loadings are still valid PCA loadings. – utobi Apr 16 '23 at 17:59
  • 4
    Perhaps you could include an example quotation of the literature that says highly correlated variables are a problem for PCA. It's hard to understand that statement in a vacuum. – Sycorax Apr 16 '23 at 18:11
  • 2
    See https://stats.stackexchange.com/questions/50537, which concerns how to treat groups of highly correlated variables in PCA (of which perfect correlation is a special case). – whuber Apr 16 '23 at 18:13
  • 4
    This claim doesn't make a lot of sense to me: PCA is a method specifically for dealing with highly correlated features. – Cliff AB Apr 16 '23 at 18:33
  • Yes a problem is a relative term, and it ultimately depends on what the purpose is and what is acceptable for your process. I see that the thread @whuber refers to couldn't explain the problem and the reflections around this more perfectly. Funny that the search on StackExchange didn't show me that hyper relevant thread before I posted when I searched. – Cordex Apr 18 '23 at 15:46
  • https://stats.stackexchange.com/questions/50537 – Cordex Apr 18 '23 at 15:46
  • Searching on SE sites is an arcane art ;-). Using the "site:" argument in a Google search sometimes works better. – whuber Apr 18 '23 at 15:50

1 Answers1

2

PCA by itself is just an algebraic transformation. Highly but imperfectly correlated variables do not cause trouble for it. They might cause trouble in an inferential statistical analysis that follows the PCA, but that is another step of the process.

Richard Hardy
  • 67,272
  • Thanks for the clarification. I forgot to mention that the purpose is statistical analysis. I've added it now. – Cordex Apr 16 '23 at 18:08
  • 3
    If PCA is not a "statistical analysis," what is it?? – whuber Apr 16 '23 at 18:14
  • @whuber, it is an algebraic manipulation. I find the first comment of this thread helpful in that regard. A statistical model that is related to PCA is the common factor model. – Richard Hardy Apr 16 '23 at 18:39
  • 2
    @Richard Almost everything performed in statistics is an "algebraic manipulation"! Moreover, because PCA is used for statistical analysis and is interpreted statistically, it seems (at best) misleading to claim some distinction based on the form of mathematics used for the analysis. – whuber Apr 16 '23 at 19:12
  • @whuber, while almost everything in statistics is an algebraic manipulation, this does not seem to be a relevant point to make. The relevant point to me is that not every algebraic manipulation constitutes statistical analysis. PCA is neither a stochastic model nor an estimator, a test, a prediction nor a sampling scheme. This is not an exhaustive list, of course, but I do not immediately see where PCA fits in the field of statistics. It does not even have assumptions (I think). – Richard Hardy Apr 16 '23 at 19:27
  • @Richard I agree that referring to "algebraic manipulation" is not relevant! By omitting any reference to interpretation you seem to adopt a narrow conception of what PCA is and, because you have notably omitted exploratory data analysis, which is one of the principal applications of PCA, it suggests you are writing from a narrow conception of what statistical analysis is, too. While not wishing to criticize those perspectives, I would like to suggest they do justice neither to PCA nor to the practice of statistics, both of which benefit from broader, rather than narrower, conceptions. – whuber Apr 16 '23 at 19:32
  • 1
    @RichardHardy I think your argument is that PCA is essentially SVD with some minor changes. That is one interpretation and in that case, there isn't even a concept of "noise", no more than an operator like division or subtraction. However, in most use cases, the use of PCA is in the procedure of computing the full PCA but only using the top components, under the assumption that the smaller components are driven by noise. In this use case, dealing with noise is a very definite (and very hard!) part of the procedure and so it seems to fit quite clearly into the category of a statistical method. – Cliff AB Apr 16 '23 at 19:53
  • *very hard part of the procedure to do in a well justified manner. Very easy to deal with in an adhoc manner: just drop the smaller components you don't want to deal with :) – Cliff AB Apr 16 '23 at 19:54
  • 1
    @whuber, I think I found a way to interpret PCA as statistical analysis that I can live with. I would classify it under descriptive statistics. I will therefore update my answer. Thank you for a thought-provoking discussion! – Richard Hardy Apr 16 '23 at 20:03
  • 1
    @CliffAB, you are likely referring to something like principal component regression (PCR) rather than PCA, and that is where the difference between statistical and non-statistical lies, I think. I mentioned the common factor model that also may rely on PCA as the first step as another similar statistical technique. Anyway, thank you for a helpful comment. I will update my answer as I noted in my last comment to whuber. – Richard Hardy Apr 16 '23 at 20:04
  • PCR is one way that this can occur, although I would say it is less common: its very hard to interpret coefficients in PCR. Often the first few components are examined directly to give insight into common patterns in the data or fed into a clustering algorithm. With clustering, low-rank PCA is considered denoising of the data. – Cliff AB Apr 16 '23 at 20:24