Why does ridge estimate become better than OLS by adding a constant to the diagonal?

Question

I understand that the ridge regression estimate is the $\beta$ that minimizes residual sum of square and a penalty on the size of $\beta$

$$\beta_\mathrm{ridge} = (\lambda I_D + X'X)^{-1}X'y = \operatorname{argmin}\big[ \text{RSS} + \lambda \|\beta\|^2_2\big]$$

However, I don't fully understand the significance of the fact that $\beta_\text{ridge}$ differs from $\beta_\text{OLS}$ by only adding a small constant to the diagonal of $X'X$. Indeed,

$$\beta_\text{OLS} = (X'X)^{-1}X'y$$

My book mentions that this makes the estimate more stable numerically -- why?
Is numerical stability related to the shrinkage towards 0 of the ridge estimate, or it's just a coincidence?

Glen_b · Accepted Answer · 2018-02-11T01:22:11.057

93

In an unpenalized regression, you can often get a ridge* in parameter space, where many different values along the ridge all do as well or nearly as well on the least squares criterion.

* (at least, it's a ridge in the likelihood function -- they're actually valleys$ in the RSS criterion, but I'll continue to call it a ridge, as this seems to be conventional -- or even, as Alexis points out in comments, I could call that a thalweg, being the valley's counterpart of a ridge)

In the presence of a ridge in the least squares criterion in parameter space, the penalty you get with ridge regression gets rid of those ridges by pushing the criterion up as the parameters head away from the origin:

enter image description here
[Clearer image]

In the first plot, a large change in parameter values (along the ridge) produces a miniscule change in the RSS criterion. This can cause numerical instability; it's very sensitive to small changes (e.g. a tiny change in a data value, even truncation or rounding error). The parameter estimates are almost perfectly correlated. You may get parameter estimates that are very large in magnitude.

By contrast, by lifting up the thing that ridge regression minimizes (by adding the $L_2$ penalty) when the parameters are far from 0, small changes in conditions (such as a little rounding or truncation error) can't produce gigantic changes in the resulting estimates. The penalty term results in shrinkage toward 0 (resulting in some bias). A small amount of bias can buy a substantial improvement in the variance (by eliminating that ridge).

The uncertainty of the estimates are reduced (the standard errors are inversely related to the second derivative, which is made larger by the penalty).

Correlation in parameter estimates is reduced. You now won't get parameter estimates that are very large in magnitude if the RSS for small parameters would not be much worse.

edited Feb 11 '18 at 01:22

answered Oct 11 '14 at 20:03

Glen_b

282,281

You're right that ridge regression shrinks estimates but don't make them sparse. I thus edited my question to ask whether numerical stability is related to this shrinkage. 1) Could you explain more why adding a constant term to the diagonal = "get rid of those ridges"?

Heisenberg

Oct 11 '14 at 20:08

I've drawn a diagram in parameter space that should help clarify what the impact is. – Glen_b Oct 11 '14 at 22:39

1

Nice plots! Did you use rgl? How did you get the arrows & text in there, did you just add them on top of a png? – gung - Reinstate Monica Oct 11 '14 at 23:09

1

@gung I just put them in later; lazy but fast. Yes, rgl. – Glen_b Oct 11 '14 at 23:12

5

This answer really helps me understand shrinkage and numerical stability. However, I'm still unclear about how "adding a small constant to $X'X$" achieves these two things. – Heisenberg Oct 12 '14 at 16:45

4

Adding a constant to the diagonal* is the same as adding a circular paraboloid centered at $0$ to the RSS (with the result shown above - it "pulls up" away from zero - eliminating the ridge). $\quad\quad$ *(it's not necessarily small, it depends on how you look at it and how much you added) – Glen_b Oct 12 '14 at 17:05

7

Glen_b the antonym of "ridge" in the English language that you are looking for (that path/curve along a valley floor) is thalweg. Which I just learned about two weeks ago and simply adore. It doesn't even sound like an English word! :D – Alexis Oct 12 '14 at 18:01

5

@Alexis That would no doubt be a handy word, so thanks for that. It probably doesn't sound English because it's a German word (indeed the thal is the same 'thal' as in "Neanderthal" = "Neander valley", and weg ='way'). [As it was, I wanted "ridge" not because I couldn't think of what to call it, but because people seem to call it a ridge whether they're looking at likelihood or RSS, and I was explaining my desire to follow the convention, even though it seems odd. Thalweg would be an excellent choice for just the right word, were I not following the odd thalweg of convention.] – Glen_b Oct 12 '14 at 21:28

@Glen_b Thank you for your help so far. I understand why "adding a circular paraboloid centered at 0 to the RSS" (which is the same as adding $\lambda ||\beta||_2^2$ to the objective function) works. I can also derive algebraically why that is equivalent to adding a constant to the diagonal of $X'X$ in the minimizer. However, what I'm hoping for is a direct geometric interpretation of adding a constant to the diagonal of $X'X$. For example, OLS does not work (well) when $X'X$ is not (almost not) full-rank. Could this be related? – Heisenberg Oct 13 '14 at 16:21

5

X becomes close to a matrix not of full rank (and hence X'X becomes nearly singular) exactly when a ridge appears in the likelihood. The ridge is a direct consequence of a nearly linear relationship between columns of $X$, which makes $\beta$s (nearly) linearly dependent. – Glen_b Oct 13 '14 at 20:23

4

Very nice answer. Perhaps one could also add that by introducing a little bias we make the variance of the OLS estimator lower. – JohnK Dec 12 '14 at 00:48

Wow, really nice answer for the term "stable", never thought this way. Thanks!! – Haitao Du Aug 09 '16 at 14:07

@JohnK I added a few words to discuss bias more directly – Glen_b Aug 09 '16 at 19:50

@Glen_b thanks for your explanation! Can I ask you to send me the matlab code that you used for plotting the two surfaces above? Since I'm not sure the meaning of each axis. It will help me to understand why Ridge is sometimes needed instead of Least Squares. – David Feb 13 '19 at 12:46

It was done in R, not matlab. The vertical axis is the sums of squares of residuals, the horizontal axes are the parameter estimates (not counting the constant-term). Each axis is labelled in the diagram. – Glen_b Feb 13 '19 at 21:40

score 40 · Answer 2 · answered Feb 19 '16 at 02:51

+1 on Glen_b's illustration and the stats comments on the Ridge estimator. I would just like to add a purely mathematical (linear algebra) pov on Ridge regression which answers OPs questions 1) and 2).

First note that $X'X$ is a $p \times p$ symmetric positive semidefinite matrix - $n$ times the sample covariance matrix. Hence it has the eigen-decomposition

$$ X'X = V D V', \quad D = \begin{bmatrix} d_1 & & \\ & \ddots & \\ & & d_p \end{bmatrix}, d_i \geq 0 $$

Now since matrix inversion corresponds to inversion of the eigenvalues, the OLS estimator requires $(X'X)^{-1} = V D^{-1} V'$ (note that $V ' = V^{-1}$). Obviously this only works if all eigenvalues are strictly greater than zero, $d_i > 0$. For $p \gg n$ this is impossible; for $n \gg p$ it is in general true - this is were we are usually concerned with multicollinearity.

As statisticians we also want to know how small perturbations in the data $X$ change the estimates. It is clear that a small change in any $d_i$ leads to huge variation in $1 / d_i$ if $d_i$ is very small.

So what Ridge regression does is move all eigenvalues further away from zero as

$$ X'X + \lambda I_p = V D V' + \lambda I_p = V D V' + \lambda V V' = V (D + \lambda I_p) V', $$ which now has eigenvalues $d_i + \lambda \geq \lambda \geq 0$. This is why choosing a positive penalty parameter makes the matrix invertible -- even in the $p \gg n$ case. For Ridge regression a small variation in the data $X$ does not have anymore the extremely unstable effect it has on the matrix inversion.

The numerical stability is related to shrinkage to zero as they both are a consequence of adding a positive constant to the eigenvalues: it makes it more stable because a small perturbation in $X$ does not change the inverse too much; it shrinks it close to $0$ since now the $V^{-1} X'y$ term is multiplied by $1 / (d_i + \lambda)$ which is closer to zero than the OLS solution with inverse eigenvalues $1 / d$.

This answers satisfactorily answers the algebra part of my question! Together with Glen_b answer it makes a full explanation of the issue. — Heisenberg, Feb 29 '16 at 16:17
Could someone please clarify why in the case of p>>n it is impossible that the eigenvalues are more than zero? Or more generally, how do the dimensions of a matrix effect the sign of the eigenvalues? — Sean, May 20 '20 at 21:41

score 20 · Answer 3 · edited Mar 17 '20 at 06:06

@Glen_b's demonstration is wonderful. I would just add that aside from the exact cause of the problem and description about how quadratic penalized regression works, there is the bottom line that penalization has the net effect of shrinking the coefficients other than the intercept towards zero. This provides a direct solution to the problem of overfitting that is inherent in most regression analyses when the sample size is not enormous in relation to the number of parameters to be estimated. Almost any penalization towards zero for non-intercepts is going to improve predictive accuracy over an un-penalized model.

Why does ridge estimate become better than OLS by adding a constant to the diagonal?

3 Answers3

Linked

Related