14

I've read that ridge regression could be achieved by simply adding rows of data to the original data matrix, where each row is constructed using 0 for the dependent variables and the square root of $k$ or zero for the independent variables. One extra row is then added for each independent variable.

I was wondering whether it is possible to derive a proof for all cases, including for logistic regression or other GLMs.

Snowflake
  • 1,185
  • Nope, I got it from http://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf and it was mentioned briefly on page 335-4 – Snowflake Feb 10 '15 at 11:58
  • 1
    Sorry about deleting the comment on you there. I decided I was mistaken before I saw your reply and deleted it. – Glen_b Feb 10 '15 at 12:11
  • 2
    A slight generalization of this problem is asked and answered at http://stats.stackexchange.com/questions/15991. Because it does not address the logistic regression part of this question, I am not voting to merge the two threads. – whuber Feb 10 '15 at 14:13
  • GLMs are fit using iteratively reweighted least squares, as in http://bwlewis.github.io/GLM/, and so within each iteration one can subsitute the regular weighted least squares step with a ridge penalized weighted least squares step to get a ridge penalized GLM. In fact, in combination with adaptive ridge penalties this is used to fit L0 penalized GLMs, as in the l0ara package, see https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0159-z and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148620 – Tom Wenseleers Aug 29 '19 at 19:26

3 Answers3

17

Ridge regression minimizes $\sum_{i=1}^n (y_i-x_i^T\beta)^2+\lambda\sum_{j=1}^p\beta_j^2$.

(Often a constant is required, but not shrunken. In that case it is included in the $\beta$ and predictors -- but if you don't want to shrink it, you don't have a corresponding row for the pseudo observation. Or if you do want to shrink it, you do have a row for it. I'll write it as if it's not counted in the $p$, and not shrunken, as it's the more complicated case. The other case is a trivial change from this.)

We can write the second term as $p$ pseudo-observations if we can write each "y" and each of the corresponding $(p+1)$-vectors "x" such that

$(y_{n+j}-x_{n+j}^T\beta)^2=\lambda\beta_j^2\,,\quad j=1,\ldots,p$

But by inspection, simply let $y_{n+j}=0$, let $x_{n+j,j}=\sqrt{\lambda}$ and let all other $x_{n+j,k}=0$ (including $x_{n+j,0}=0$ typically).

Then

$(y_{n+j}-[x_{n+j,0}\beta_0+x_{n+j,1}\beta_1+x_{n+j,2}\beta_2+...+x_{n+j,p}\beta_p])^2=\lambda\beta_j^2$.

This works for linear regression. It doesn't work for logistic regression, because ordinary logistic regression doesn't minimize a sum of squared residuals.

[Ridge regression isn't the only thing that can be done via such pseudo-observation tricks -- they come up in a number of other contexts]

Glen_b
  • 282,281
  • Thanks, I was already struggling with rewriting everything from logistic regression, but I simply couldn't implement the phoney data method. And I don't trust my own abilities sufficiently to be able to say that it is impossible. – Snowflake Feb 10 '15 at 12:28
  • At least I don't think it is. I'll take another look at the likelihood function. – Glen_b Feb 10 '15 at 12:31
  • 3
    +1 Additional related regression tricks are introduced in answers at http://stats.stackexchange.com/a/32753 and http://stats.stackexchange.com/a/26187, inter alia. – whuber Feb 10 '15 at 14:15
  • GLMs are fit using iteratively reweighted least squares though, as in http://bwlewis.github.io/GLM/, and so within each iteration one can subsitute the regular weighted least squares step with a ridge penalized weighted least squares step to get a ridge penalized GLM. In fact, in combination with adaptive ridge penalties this is used to fit L0 penalized GLMs, as in the l0ara package, see https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0159-z and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148620 – Tom Wenseleers Aug 29 '19 at 19:25
  • @TomWenseleers thanks, yes, that makes complete sense – Glen_b Aug 30 '19 at 01:05
  • @Glen_b Thanks for this - I've added another answer with some more details... – Tom Wenseleers Aug 30 '19 at 08:06
1

Generalizing this recipe to GLMs indeed is not difficult as GLMs are usually fit using iteratively reweighted least squares. Hence, within each iteration one can subsitute the regular weighted least squares step with a ridge penalized weighted least squares step to get a ridge penalized GLM. In fact, in combination with adaptive ridge penalties this recipe is used to fit L0 penalized GLMs (aka best subset, ie GLMs where the total number of nonzero coefficients are penalized). This has been implemented for example in the l0ara package, see this paper and this one for details.

It's also worth noting that the fastest closed-form way of solving a regular ridge regression is using

lmridge_solve = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  solve(crossprod(X) + diag(lambdas), crossprod(X, y))[, 1]
}

for the case where n>=p, or using

lmridge_solve_largep = function (X, Y, lambda) (t(X) %*% solve(tcrossprod(X)+lambda*diag(nrow(X)), Y))[,1]

when p>n and for a model without intercept.

This is faster than using the row augmentation recipe, i.e. doing

lmridge_rbind = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  qr.solve(rbind(X, diag(sqrt(lambdas))), c(y, rep(0, ncol(X))))
}

If you would happen to need nonnegativity constraints on your fitted coefficients then you can just do

library(nnls)

nnlmridge_solve = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  nnls(A=crossprod(X)+diag(lambdas), b=crossprod(X,Y))$x
}

which then gives a slightly more accurate result btw than

nnlmridge_rbind = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  nnls(A=rbind(X,diag(sqrt(lambdas))), b=c(Y,rep(0,ncol(X))))$x 
}

(and strictly speaking only the solution nnls(A=crossprod(X)+diag(lambdas), b=crossprod(X,Y))$x is then the correct one).

I haven't yet figured out how the nonnegativity constrained case could be further optimized for the p > n case - let me know if anyone would happen to know how to do this... [lmridge_nnls_largep = function (X, Y, lambda) t(X) %*% nnls(A=tcrossprod(X)+lambda*diag(nrow(X)), b=Y)$x doesn't work]

0

One could use row augmentation to get something that approximates the ridge penalty.

The GLM regression with response $y_i$ and predictors $X_i$ minimizes the terms

$$w_i\frac{(y_i-g(\beta X_i))^2}{V(g(\beta X_i))}$$

where $g$ is the inverse link function, $V$ a function that describes the variance as function of the mean, and $w_i$ an additional weight.

For logistic regression we could augment rows with $y_i = 0.5$ and for one of the $x_{ij}$ in $X_i$ some very small value (and the others zero) such that the inverse link function is approximately linear

$$g(\beta X_i) = \frac{1}{1+e^{-\beta_j x_{ij}}} \approx \frac{1}{2} + \frac{x_{ij}}{4} \beta_j$$

and with $V(\mu) = \mu(1-\mu)$ for logistic regression, the additional term will be approximately

$$w_i\frac{(y_i-g(\beta X_i))^2}{V(g(\beta X_i))} \approx w_i\frac{(0.5-(0.5+0.2 5 x_{ij} \beta_{j}))^2}{0.5^2} = 0.25 w_ix_{ij}^2 \beta_j^2$$

Then by changing the weight $w_i$ the amount of regularisation can be changed.

Demonstration with some code

library(glmnet)

settings

n = 100 lambda = 1 set.seed(1)

create data

x0 = rep(1,n) x1 = runif(n,-3,3) x2 = runif(n,-3,3) p = 1/(1+exp(-x0-x1-x2)) y = rbinom(n,1,p)

logistic regression with ridge penalty

using glmnet function

mod1 = glmnet(cbind(x1,x2), y, family = binomial(link = "logit"), alpha = 0, lambda = lambda, thresh = 10^-12, standardize = FALSE)

augment data for penalty

x = 0.001 w_penalty = 4lambdan/x^2 y = c(y,0.5,0.5) x0 = c(x0,0,0) x1 = c(x1,x,0) x2 = c(x2,0,x)

create a weights vector

w = c(rep(1,n) , rep(w_penalty,2))

logistic regression with ridge penalty

using augmented data

mod2 = glm(y ~ 0+x0+x1+x2, family = binomial(), weights = w, control = glm.control(maxit = 10^5, epsilon = 10^-12))

mod1$a0 mod1$beta

s0

0.6600794

2 x 1 sparse Matrix of class "dgCMatrix"

s0

x1 0.2250353

x2 0.2392547

mod2$coefficients

x0 x1 x2

0.6600794 0.2250353 0.2392547

The same approach may be used for other link functions and distribution families if the inverse link function has linear approximation near zero $g(\beta) \approx a+b\beta$.