Questions tagged [ridge-regression]

A regularization method for regression models that shrinks coefficients towards zero.

Ridge Regression is a technique which penalizes the size of regression coefficients in order to deal with multicollinear variables or ill-posed statistical problems. It is based on the Tikhonov regularization named after the mathematician Andrey Tikhonov.

Given a set of training data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{J}$, the estimation problem is:

$$\min_\beta \sum\limits_{i=1}^{n} (y_i - x_i'\beta)^2 + \lambda \sum\limits_{j=1}^J \beta_j^2$$

for which the solution is given by

$$\widehat{\beta}_{ridge} = (X'X + \lambda I)^{-1}X'y$$

which is similar to the OLS estimator but including the tuning parameter $\lambda$ and the Tikhonov matrix (in this case $I$, the identity matrix but other choices are possible). Note that, unlike the OLS estimator, the ridge regression estimator is always invertible even if there are more parameters in the model than degrees of freedom and hence there always exists a unique solution to the estimation problem.

Bayesian derivation

Ridge regression is equivalent to Bayesian linear regression assuming a Normal prior on $\beta$. Define the likelihood:

$$L(X,Y;\beta,\sigma^2) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}$$

And using a normal prior with mean 0 and variance $\alpha I_p$ on $\beta$:

$$\pi(\beta) \sim N(0,\alpha I_p)$$

Using Bayes rule, we calculate the posterior distribution:

$$P(\beta | X,Y) \propto L(X,Y;\beta,\sigma^2)\pi(\beta) $$ $$ \propto \big[\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}\big]e^{-\frac12\beta^T(\alpha^2 I_p)^{-1}\beta}$$

Maximizing the posterior is equivalent to minimizing the negative of the log of the posterior (after some algebra):

$$log (P(\beta | X,Y)) \propto -\frac12\big(\frac{1}{\sigma^2}\sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{1}{\alpha}\beta^T\beta\big)$$ $$\propto \sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{\sigma^2}{\alpha}\sum_{j=1}^{p}\beta^2 $$

Where $\frac{1}{\alpha}$ is the tuning parameter, corresponding to the choice of $\lambda$ from above.

The tuning parameter $\lambda$ determines the degree of shrinkage of the regression coefficients. The idea is to introduce some degree of bias in order to improve the variance (see bias-variance trade-off). In cases of highly multicollinear variables a small increase in bias to trade off for a lower variance can have a substantial effect.

The bias of the ridge regression estimator is $$Bias(\widehat{\beta}) = -\lambda (X'X + \lambda I)^{-1} \beta$$ It is always possible to find $\lambda$ such that the MSE of the ridge regression estimator is smaller than that of the OLS estimator.

Note that as $\lambda \rightarrow 0, \beta_{ridge} \rightarrow \beta_{ols}$ and as $\lambda \rightarrow \infty, \beta_{ridge} \rightarrow 0$. It is therefore important how to choose the value for $\lambda$. Common methods for this decision include the use of information criteria (AIC or BIC) or (generalized) cross-validation.

789 questions
17
votes
1 answer

Lagrangian relaxation in the context of ridge regression

In "The Elements of Statistical Learning" (2nd ed), p63, the authors give the following two formulations of the ridge regression problem: $$ \hat{\beta}^{ridge} = \underset{\beta}{\operatorname{argmin}} \left\{ \sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^p…
NPE
  • 5,581
  • 6
  • 37
  • 45
17
votes
3 answers

Implementing ridge regression: Selecting an intelligent grid for $\lambda$?

I'm implementing Ridge Regression in a Python/C module, and I've come across this "little" problem. The idea is that I want to sample the effective degrees of freedom more or less equally spaced (like the plot on page 65 on the "Elements of…
Néstor
  • 3,817
11
votes
2 answers

How to calculate regularization parameter in ridge regression given degrees of freedom and input matrix?

Let A be $n \times p$ matrix of independent variables and B be the corresponding $n \times 1$ matrix of the dependent values. In ridge regression, we define a parameter $\lambda$ so that: $\beta=(A^\mathrm{T}A+\lambda I)^{-1}A^\mathrm{T}B$ . Now…
Amit
  • 833
10
votes
2 answers

Understanding ridge regression results

I am new to ridge regression. When I applied linear ridge regression, I got the following results: >myridge = lm.ridge(y ~ ma + sa + lka + cb + ltb , temp, lamda = seq(0,0.1,0.001)) > select(myridge) modified HKB estimator is 0.5010689 modified…
samarasa
  • 1,467
4
votes
2 answers

Iterative method to find Ridge Regression Parameter

I have seen a method whereby instead of trying to estimate the ridge parameter (k) directly from the data (using one of the many many ridge parameter estimators in the literature) you solve for it iteratively. The method is simple enough: You simply…
Baz
  • 1,763
  • 3
  • 16
  • 27
3
votes
1 answer

Details about Ridge regression

I have a question about the mathematical details of Ridge Regression and I have not been able to find a detailed explanation. For what I know the ridge regression is a penalty term that is used to penalize the parameters of a linear regression model…
Layla
  • 621
3
votes
1 answer

How exactly to compute the ridge regression penalty parameter given the constraint?

The accepted answer in this thread does a great job of showing that there is a one-to-one correspondence between $c$ and $\lambda$ in the two formulations of the ridge regression: $$ \underset{\beta}{min}(y-X\beta)^T(y-X\beta) +…
generic_user
  • 13,339
3
votes
1 answer

Interpreting / Understanding VIF in ridge regression

I used ridge regression in order to dealing with multicollinearity but there is something that i do not understand. I used Stata command: - ridgereg y x1 x2 x3 x4 x5 x6 x7 x8 x9, model(orr|grr1|grr2|grr3) diag lmcol- in addition in orr model i…
Ant
  • 337
2
votes
0 answers

Does the test error for ridge regression include the regularization term or not?

When you compute the test error for ridge regression, is it typically computed with the regularization term in it?
2
votes
1 answer

Relations between regularization constant and parameter space in ridge regularization

Can it be shown that the $||w^*||_2$, where $w^* = (X'X+\lambda I)^{-1}X'Y$ , is inversely affected by the regularization constant $\lambda$, i.e. it is of $O(\frac{1}{\lambda})$ ?
Maverick Meerkat
  • 3,403
  • 27
  • 38
2
votes
1 answer

Why cant Ridge Regression benift from negative lamda?

in Rigid regression, we generally set a positive Lambda for regularization to get a less Residual. Why cant we have a negative Lambda in a regularization if we can benefit from it?
Wu You
  • 23
2
votes
1 answer

Ridge regression - what is k=0?

I'm getting to know ridge regression and what to check my understanding quickly. I understand that k is the shrinkage parameter. If I'm reading off coefficients where k=0, is that equivalent to an OLS linear regression? Or have I got totally the…
1
vote
1 answer

How does ridge regression solve the multidimensionality problem if it doesn't assign zero to some coefficients

I want to understand how does ridge regression solve the multidimensionality problem (when number of X variable is higher than the number of observations)? It shrinks the coefficients by introducing bias by incorporating the lambda term. It is clear…
mgdata
  • 33
0
votes
1 answer

Why doesn't $\lambda=1$ in ridge regression?

Take traditional Ridge regression, $$ Y_i = \sum_{j=0}^m \beta_{j} X_{i,j} + \epsilon_i $$ we minimize $$ L_{ridge} = \arg \min_\hat{\beta}(\lambda||\beta||_2^2 + ||\epsilon||^2)$$ where $\lambda$ is the regularization penalty. Suppose instead our…
dashnick
  • 161
  • 8
0
votes
1 answer

Ridge regression derivation from Murphy Machine Learning

Ridge regression, used to prevent overfitting, penalizes the coefficients $w_i$ of linear regression if they are too large. It is the solution to the problem $$\arg\max_\textbf w \sum_{i=1}^N \ln \mathcal N(y_i|w_0+\textbf w^T\textbf x_i,…
user308286
1
2