5

Is it possible to penalize coefficients toward a number other than zero in a ridge regression in R?

For example, let's say I have dependent variable Y and independent variables X1,X2,X3, and X4. Because of the multicollinear nature of the ivs, ridge regression is appropriate. But say I'm fairly certain that the coefficient of X1 is near 5, X2 is near 1, X3 is near -1, and X4 is near -5.

Is there a ridge package and method in R where I can implement penalties on the coefficients of the ivs that penalize them toward those numbers instead of 0? I'd love to see an example in R with my example data, if possible. Thank you.

James
  • 51
  • For a solution that also works with logistic regression or other generalized linear models, see here – kjetil b halvorsen Jan 04 '19 at 12:07
  • 4
    That's known as nonzero centered ridge regresqsion and can be achieved by row-augmenting your covariate matrix X with a diagonal matrix with sqrt(lambdas) along the diagonal and adding center * sqrt(lambdas) to your outcome variable, where center is the desired target values towards which your coefficients should be shrunk. See https://www.sciencedirect.com/science/article/abs/pii/S0047259X21001603 and references therein. – Tom Wenseleers Jun 11 '23 at 07:51

3 Answers3

11

A simple way to do this is to subtract the "centering value" of the coefficient times its associated variable from the left-hand side. To go with your example,

$Y = \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + e$

Assume the coefficient values should be centered at (5,1,-1,-5) respectively. Then:

$Y - 5X_1 -X_2 +X_3 +5X_4 = (\beta_1-5)X_1 + (\beta_2-1)X_2 + (\beta_3+1)X_3 + (\beta_4+5)X_4 + e$

and, redefining terms, you have:

$Y^* = \beta_1^*X_1 + \beta_2^*X_2 + \beta_3^*X_3 + \beta_4^*X_4 + e$

A standard ridge regression would shrink the $\beta_i^*$ towards 0, which is equivalent to shrinking the original $\beta_i$ towards the specified centering values. to see this, consider a fully-shrunk $\beta_4^* = 0$, then $\beta_4+5 = 0$ and therefore $\beta_4 = -5$. Shrinkage accomplished!


A slightly more formal demonstration follows. The required objective function in $\beta$ is $$ (Y-X\beta)^T (Y-X\beta) + (\beta - \gamma)^T(\beta-\gamma) $$ where $\gamma$ is the vector of centring values. In the first term make the substitution $ Y-X\beta=Y-X\gamma-X\beta+X\gamma=Y^*-X\beta^*$ (the errors for a given fit are of course unchanged by this reëxpression of the model), and in the second $\beta-\gamma =\beta^*$; to obtain a zero-centered objective function in $\beta^*$: $$ (Y^* - X\beta^*)^T (Y^* - X\beta^*) + \lambda{\beta^*}^T{\beta^*}^T $$

jbowman
  • 38,614
  • 1
    What if you want to do it with logistic regression? – David Masip Jan 04 '19 at 11:10
  • 1
    But how does this solution allow for a given level of shrinkage towards a given centering value of the ridge regression? I don't see an lambda / regularisation parameter appear in here? – Tom Wenseleers Jun 20 '23 at 14:02
  • 1
    Or you are assuming that one has a regular ridge solver available, where you would specify a given regularisation level, and where you shrink to a given center essentially by using a given offset? I think the OP was also asking for some reproducible R code... – Tom Wenseleers Jun 20 '23 at 14:40
  • 3
    @TomWenseleers - this reformulates the problem into a standard linear regression that can be solved using any ridge regression, Lasso, Elastic net, or other shrinkage routines with the centering being at the desired values. So, the answer to "is it possible" is clearly meant to be "yes" and "is there a ridge package" is "yes, any ridge package will do" ... as I say in the text, "a standard ridge regression would..." – jbowman Jun 20 '23 at 14:46
  • 1
    @jbowman OK thanks I see what you mean... You're just assuming you have an external ridge regression package already available... – Tom Wenseleers Jun 20 '23 at 14:50
  • 1
    @TomWenseleers - Yes... those were early days for me on the site, and now I'd write a more complete answer with R code :) Maybe I'll update later. – jbowman Jun 20 '23 at 14:51
  • 1
    @jbowman No worries, was asking for some clarification after someone commented on my post here https://meta.stackexchange.com/questions/387575/examples-of-chatgpt-being-right-and-wrong-on-stack-exchange/390348#390348 where I tried to compare the quality of ChatGPT4 generated answers with some human answers. And where the question was whether your or the ChatGPT4 answer below was the best one in this case. What do you reckon? How would you rate each yourself? Yours and my ChatGPT4 based answer below? – Tom Wenseleers Jun 20 '23 at 14:55
  • 3
    @TomWenseleers the part of the question that is asking how to write specific R code is nowadays off-topic. A question should have a focus on a statistical problem and coding examples can be part of explanations or the background of a problem but should not be the topic itself. Instructions how to write code can be found in many other places, like here: https://stats.meta.stackexchange.com/questions/793/internet-support-for-statistics-software – Sextus Empiricus Jun 20 '23 at 19:16
  • 2
    I think R code would be otiose for something so simple, especially as @TomWenseleers has already provided an illustration for the example given in the question. I've taken the liberty of adding some linear algebra as a footnote, but feel free of course to remove or edit it. (I know it's rather trivial, but perhaps it wasn't immediately clear to readers new to ridge regression that the two methods - yours & Tom's - give the same result.) – Scortchi - Reinstate Monica Jun 21 '23 at 10:17
  • 1
    @Scortchi-ReinstateMonica - Thanks! ... for a 10-year-old Q&A, this sure is getting a lot of attention, indirectly thanks to ChatGPT. – jbowman Jun 21 '23 at 14:49
5

That's known as nonzero centered ridge regression and can be achieved by row-augmenting your covariate matrix X with a diagonal matrix with sqrt(lambdas) along the diagonal and concatenating prior.mean * sqrt(lambdas) to your outcome variable, where prior.mean is the desired target values towards which your coefficients should be shrunk and lambdas is the vector with the desired L2 ridge penalty on your coefficients; and then performing an ordinary least-squares regression of the (row-augmented) target vector on the (row-augmented) matrix X. See https://www.sciencedirect.com/science/article/abs/pii/S0047259X21001603 and references therein. You can also compare with this post for the same recipe for a regular zero-centered ridge regression.

For your example:

First, generate some example data:

set.seed(123)

Number of observations

n <- 100

Generate some correlated variables

X1 <- rnorm(n, mean = 5) X2 <- X1 + rnorm(n, mean = 1) X3 <- X1 + rnorm(n, mean = -1) X4 <- X1 + rnorm(n, mean = -5)

Make a covariate matrix X

X <- cbind(X1, X2, X3, X4)

Generate a response y

y <- 5X1 + X2 - X3 - 5X4 + rnorm(n)

This is our center (the coefficients we believe are "true", ie the mean of our implicit Gaussian prior)

prior.mean <- c(5, 1, -1, -5)

Next, we can apply a non-zero centered ridge regression. Note that lambda is a regularization parameter and would typically be chosen via cross-validation:

# Regularization parameter 
# note: you can use a vector of lambda values if you would like
# to penalize different coefficients to different extents
# (e.g. penalty on intercept is usually set to zero)
lambda <- 100

Number of features

p <- ncol(X)

Augment X

X_augmented <- rbind(X, diag(sqrt(lambda), p))

Augment y

y_augmented <- c(y, prior.mean * sqrt(lambda))

Solve nonzero centered ridge regression as this least squares problem

beta <- solve(t(X_augmented) %% X_augmented, t(X_augmented) %% y_augmented)

So in other words all we need to do to solve this nonzero centered least-square ridge regression is solve a least square problem but with some extra observations added. We can do this just using a regular linear model (lm), so no need even to use any fancy ridge regression package like ridge or glmnet :

lm(y_augmented ~ X_augmented)

For example if we would set lambda to zero we would then get the standard (zero intercept) least square result

         [,1]
X1  5.1977104
X2  0.8903286
X3 -1.0616111
X4 -5.1220296

but with a lambda of 100 we would get values shrunken more towards the center of your Gaussian prior (nonzero centered ridge regression is in a Bayesian sense equivalent to the MAP estimate with a nonzero-centered Gaussian prior, while standard ridge regression would be equivalent to using a zero-centered Gaussian prior).

         [,1]
X1  5.0477868
X2  0.9773907
X3 -1.0079250
X4 -5.0728437

Adjusting the response to shrink towards your prior mean comes down to the same thing:

y_star <- y - X %*% prior.mean
X_aug <- rbind(X, diag(sqrt(lambda), ncol(X)))
y_aug_star <- c(y_star, rep(0, ncol(X)))
beta_star <- solve(t(X_aug) %*% X_aug, t(X_aug) %*% y_aug_star)
beta <- beta_star + matrix(prior.mean, nrow = length(prior.mean), ncol = 1)

Some of the maths behind this: the objective function for standard (zero-centered) Ridge Regression is:

argmin(β) [(y - Xβ)ᵀ(y - Xβ) + λβᵀβ]

This corresponds to a zero-centered Gaussian prior for β.

To consider a non-zero centered Ridge Regression, we adjust the objective function to center at a given target γ:

argmin(β) [(y - Xβ)ᵀ(y - Xβ) + λ(β - γ)ᵀ(β - γ)]

Now, let's reformulate this as a least squares problem:

If we define z = [y; γ*sqrt(λ)] and W = [X; sqrt(λ)I], where I is the identity matrix, the problem becomes:

argmin(β) [(z - Wβ)ᵀ(z - Wβ)]

To see this, we can expand out the terms:

argmin(β) [(z - Wβ)ᵀ(z - Wβ)]

= argmin(β) [zᵀz - zᵀWβ - βᵀWᵀz + βᵀWᵀWβ]

= argmin(β) [(y - Xβ)ᵀ(y - Xβ) + γᵀγλ - 2γᵀβλ + βᵀβλ]

= argmin(β) [(y - Xβ)ᵀ(y - Xβ) + λ(β - γ)ᵀ(β - γ)]

Therefore, solving the least squares problem with z = [y; γ*sqrt(λ)] and W = [X; sqrt(λ)I] corresponds to solving the non-zero centered Ridge Regression problem. The regularization term now encourages the coefficients to be close to γ instead of zero.

In a Bayesian context, the standard Ridge regression minimizes the negative log posterior, which is equivalent to minimizing:

[(y - Xβ)ᵀ(y - Xβ) + λβᵀβ]

This can be viewed as the sum of the negative log likelihood [(y - Xβ)ᵀ(y - Xβ)] and the negative log prior λβᵀβ, where λβᵀβ corresponds to the log of a Gaussian distribution with mean zero and variance 1/λ.

If you move to a non-zero centered Ridge regression, you're effectively changing the prior to a Gaussian distribution that is centered around a non-zero mean. Specifically, for a center γ, the new objective function becomes:

[(y - Xβ)ᵀ(y - Xβ) + λ(β - γ)ᵀ(β - γ)]

Which is equivalent to minimizing the sum of the negative log likelihood [(y - Xβ)ᵀ(y - Xβ)] and the negative log prior λ(β - γ)ᵀ(β - γ). Here, λ(β - γ)ᵀ(β - γ) corresponds to the log of a Gaussian distribution with mean γ and variance 1/λ.

(Full disclosure: this is a ChatGPT4 based answer, but I verified everything for correctness, i.e. I tested the code & checked the maths)

  • 2
    Please don't generate answers with chatGPT especially when there is already a perfectly sufficient answer. This answers is verbose and it could be made more intuitive by shortening it. I would suggest to fully write out the matrix equation from the example. – Sextus Empiricus Jun 13 '23 at 16:16
  • 4
    None of the other answers gave working R code so I thought this was useful... And none made reference to nonzero centered ridge regression, which is what this technique is known as. – Tom Wenseleers Jun 13 '23 at 16:27
  • 1
    @SextusEmpiricus Also the question asked explicitly about "Ridge Regression in R" so working R code I think is quite essential here. The Bayesian approach suggested above would work, but I believe the OP was interested in a faster ML solution. – Tom Wenseleers Jun 13 '23 at 16:37
  • 1
    @SextusEmpiricus The fact that the OP did not flag any of the other answers as the correct one to me also suggests he wasn't happy with any of the other answers... – Tom Wenseleers Jun 15 '23 at 08:33
  • 1
    "weighted least squares"? - it's just ordinary least squares. – Scortchi - Reinstate Monica Jun 20 '23 at 12:48
  • 1
    Yes that "weighted" did not belong there - edited now... – Tom Wenseleers Jun 20 '23 at 12:56
  • Was the word “weighted” provided by you or chatgpt? – Sycorax Jun 20 '23 at 12:58
  • @Sycorax-OnStrike This is what it says to defend its wording - which I can kind of live with : https://www.dropbox.com/s/1xjbxeyzwjxfdvt/chatgpt%20nonzero%20centered%20ridge%20regression.png?dl=1 – Tom Wenseleers Jun 20 '23 at 13:10
  • 1
    I don't think that ChatGPT's explanation in the screenshot holds water. Weighted least squares has the form $\hat \beta = (X^T W X)^{-1}X^T W y$, which is meaningfully different from the ChatGPT explanation. ChatGPT does use the word "loosely," but I don't see any real connection between the two problems. – Sycorax Jun 20 '23 at 13:29
  • @Sycorax-OnStrike Yes of course what you show here is standard weighted LS regression, which this is not. ChatGPTs explanation is that one could still call problem above a weighted LS regression too, since in that particular problem one is augmenting the covariate matrix X and the outcome variable y with extra observations, and the observation weights of the original observations and the augmented observations could be taken to be 1 and lambda here. But yes, I wouldn't use the term "weighted" there no... – Tom Wenseleers Jun 20 '23 at 13:44
2

Ridge regression sets a normal prior centered at zero for each parameter. To penalize coefficients towards different values, just center the priors around your target instead of around 0. I think you can accomplish this using the bayesGLM function of the arm package with the following parameters:

  • prior.mean : set this to vector containing the values you want to use as the centers for your priors (the values you want to penalize "towards")
  • prior.df : set this to Inf to get normal priors. Possibly not necessary. There should be a way to control your prior variance, but it's not clear to me how this would be accomplished from looking at the documentation.

http://cran.r-project.org/web/packages/arm/arm.pdf

David Marx
  • 7,127
  • Hmm, I'm sure I'm missing something but I tried this and it gave me the same coefficients as a normal lm() regression – James Nov 19 '13 at 05:42
  • You would also still need to set the SD of the priors, which to mimick ridge regression (interpreted as the MAP solution of a Bayesian LM with Gaussian priors then) with a given lambda would still be a function of the noise and your data & the desired level of regularisation lambda. I think the OP was rather interested in a ML solution, in which case my answer is better & more convient I think (and computationally much faster)... – Tom Wenseleers Jun 15 '23 at 08:43