1

The point of regularization methods (for example ridge regression) is to penalize large ordinary least squares estimates. We know that variance-covariance matrix for OLS estimates can be decomposed using the spectral theorem into \begin{align} \Sigma_{\hat \beta} = \sigma^2 (X^TX)^{-1} = \sigma^2 U\Lambda^{-1} U^T \end{align} Where $\Lambda$ is diagonal with eigenvalues of $X^TX$. Obviously, if there's a high degree of multicollinearity among the predictors, then $X^TX$ is nearly irreversible and thus has at least one very small eigenvalue $\lambda_i$, therefore $\lambda_i^{-1}$ is huge and so is the variance (standard error) of $\hat \beta_i$. That uncertanity in the estimate is expected. However, what is unclear to me is why would multicollinearity imply the large value of the $\hat \beta_i$? What's the reason for the two highly correlated variables to have large absolute values but with the opposite sign?

  • (By "large value of the $\hat\beta_i$" I presume you mean relatively large standard error of estimate.) Have you searched our site for answers, such as https://stats.stackexchange.com/a/104792/919? I'm sure there are many posts here on CV that point out collinearity implies some of the $\beta_i$ cannot even be identified: they could have literally any value. Any reasonable estimate of their standard error would have to be arbitrarily large. – whuber Dec 07 '23 at 20:12
  • @whuber Thanks!, I've seen a couple of these questions. However, I deliberately asked about the large value of the estimator itself. I happened to see some examples where e.g. collinearity between two IVs caused estimates of their respective slopes to be large in the absolute value, but with the opposite sign. I simply cannot see the reason why the uncertanity about the estimates would imply that we want to penalize large values of $\beta$. – Adam Bogdański Dec 07 '23 at 20:27
  • I don't understand what you're trying to get at. When an estimate can have an arbitrary value, nothing prevents it from being arbitrarily large. Simple example: estimate the mean weight of adult male homo sapiens in Kg using the model $\text{weight}=\beta_1+\beta_2+\text{error}.$ There's nothing to rule out, say, $\beta_1=10^{300}$ and $\beta_2 = 100-10^{300}$ (approximately). – whuber Dec 07 '23 at 22:04
  • @whuber, not sure, but the question might be: are large estimates the sole and exclusive manifestation of ill-posed/numerically unstable problems? It seems they aren't, since there might be naturally big coefficients, or numerical instability might produce arbitrary small estimates. If so, why shrink coefficients? – forveg Dec 09 '23 at 14:51
  • My partial take on this is that Ridge regression intentionally trades smaller variance for increased bias. It's an assumption of the model, not a universal principle, and in some cases it could be undesirable. Smaller variance is achieved via shrinkage, which is done not uniformly, but proportional to eigenvalues, so that, in fact, the greatest relative shrinkage is applied to vectors, corresponding to the smallest eigenvalues (ie directions of the smallest variance). – forveg Dec 09 '23 at 15:13
  • I appreciate the comments. To put some reference: "When there are many correlated variables in a linear regression model, their coefficients can become poorly determined and exhibit high variance. A wildly large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin.", The Elements of Statistical Learning (2nd edition), p. 63. – Adam Bogdański Dec 09 '23 at 17:54

2 Answers2

1

If I understand your question, you are asking why you get large coefficient values with multicollinearity.

You would expect this to happen when the residuals have a component in the orthogonal direction to the multicollinearity direction .

Since by the assumption of multicollinearity, the orthogonal direction is a small component you need large weights to achieve a significant reduction in the error.

Consider dependent variable,y, $y=x+z$, with independent variables $$x_1=x \\$x_2=x+\epsilon z\\ \epsilon \ll 1$$.

then we need $\beta_2=1/\epsilon \gg 1$ and so $\beta_1=-1/\epsilon + 1$

seanv507
  • 6,743
1

What's the reason for the two highly correlated variables to have large absolute values

In the illustration below you see a cloud of points simulated from a normal distribution $y_1 \sim N(5,0.16)$ and $y_2 \sim N(0,0.16)$. The range of the coordinates of these points is about two, e.g. $y_1$ lies between 4 and 6.

Superposed are two alternative coordinate axes for values of $x_1$ and $x_2$ which have a similar distance between the succesive coordinate points as the coordinate axes $y_1$ and $y_2$ (the same Euclidean distance when measured in the $y$ coordinate frame).

But, because the two axes are correlated the coordinates of the points have a much larger range. The points range from 0 to 5 in the coordinates $x_1$ and $x_2$.

correlated axes

If $x_1$ and $x_2$ are your predictors then the range of the coordinates would be much larger compared to predictors that are more perpendicular, e.g. the example of $z_1$ and $z_2$ in the image below.

perpendicular features