In econometric analysis in some cases, such as models with interaction terms, multicollinearity between independent variables may exist. In such cases, some researchers suggest "mean-centering" strategy (subtract mean from the variables which appear as main effects and constitute the interaction terms). Usually, in the literature this methodology is applied when the dependent variable is normally distributed (e.g., Paper). Does mean centering remove multicollinearity if the variables are not normally distributed? Why is normal distribution required?
Asked
Active
Viewed 272 times
5
-
2Mean centering does not remove multicolinearity (the covariance is unchanged) – Firebug Jun 16 '23 at 06:58
-
@Firebug It changes ... See the answer by Affine here: https://stats.stackexchange.com/questions/60476/collinearity-diagnostics-problematic-only-when-the-interaction-term-is-included/61022#61022 – Sane Jun 16 '23 at 07:01
-
1Footnote #12 from A caution regarding using rules of thumb for variance inflation factors might be helpful to answer your question. – medium-dimensional Jun 16 '23 at 07:03
-
1Oh, I had missed the part about the interaction terms. My previous comment was wrong: the covariance between interaction terms and the original variables does change based on centering (this is easy to prove) – Firebug Jun 16 '23 at 08:32
-
Multicollinearity is a property of the design matrix $X.$ It has nothing to do with how its components might have arisen and therefore any distributional assumptions are irrelevant. – whuber Jun 16 '23 at 20:32
-
The question is not related whether multicolinearity is conditional on underlying distributions of independent variables or not. It is about whether mean-centering approach is efficient to solve multicollinearity problem conditional on distribution of independent variables. In other words, in the literature it has been proved that mean-centering reduces correlation between variables and therefore could fix the problem of multicolinearity, under the assumption of normality of the variables. My question was whether the normality assumption is mandatory to impose in order to reduce correlation. – Sane Jun 19 '23 at 06:21
-
Any analysis based on correlation is...based on correlation, and therefore depends only on the second moments of the distribution. That's why your inquiries about Normality appear so strange. – whuber Jun 19 '23 at 12:53
-
@whuber As far as I know, the proof that mean-centering reduces correlation between variables is given under the assumption of normality (e.g., see this paper: https://link.springer.com/article/10.3758/s13428-015-0624-x). I didn't come across to any proof with a different assumption on underlying distribution. – Sane Jun 20 '23 at 05:29
-
I cannot follow that, because mean centering is part of the very definition of correlation. Moreover, it doesn't necessarily reduce correlation. Possibly some strong distributional assumptions, along with other assumptions (concerning, for instance, how interaction variables might calculated) could imply that result and perhaps that's what's going on in the article--I didn't take a look at it, because I don't think there's anything to learn here. – whuber Jun 20 '23 at 13:24
1 Answers
6
The correlation with interaction terms does not depend on having a normal distribution.
Here is a example based on sampling from two independent exponential distributions (so with positive means) where the correlation with the interaction terms is high, but is largely removed by subtracting the means. It also shows that the correlation between the samples themselves is unaffected by subtracting the means.
set.seed(2023)
X <- rexp(10^6, 1)
Y <- rexp(10^6, 2)
mean(X)
# 0.9984457
mean(Y)
# 0.5001369
cor(X, Y)
-0.001121004
cor(X, X*Y)
0.5781138
cor(Y, X*Y)
0.5757716
cor(X-mean(X), Y-mean(X))
-0.001121004
cor(X-mean(X), (X-mean(X))*(Y-mean(Y)))
0.002229947
cor(Y-mean(Y), (X-mean(X))*(Y-mean(Y)))
-0.0002722061
You need to subtract the means before taking the product, as subtracting a constant at the end has no impact on the correlation. Compare these with the earlier results:
cor(X-mean(X), X*Y-mean(X*Y))
# 0.5781138
cor(X, (X-mean(X))*(Y-mean(Y)))
# 0.002229947
Henry
- 39,459
-
Thanks for this. So do you mean that regardless of distribution of the variables mean-centering strategy can be applied to remove multicollinearity? Please have a look at the answer of David B here: https://stats.stackexchange.com/questions/606803/multicollinearity-and-interaction-effects/606807#606807 . He claims that "A common reason why centering fails to remove multicollinearity is when variables are skewed or have a high kurtosis." – Sane Jun 16 '23 at 10:11
-
3@Sane - in my example, the exponential distributions $X$ and $Y$ are right-skewed, as are $XY$ and $(X-\bar X)(Y-\bar Y)$ though the last of these has a slightly strange shaped distribution. So it may be affected by particular cases. Your link says "mean-centering your variables prior to computing the interaction term will sometimes (but not always) reduce multicollinearity" – Henry Jun 16 '23 at 10:44