Ridge penalized GLMs using row augmentation?

Question

I've read that ridge regression could be achieved by simply adding rows of data to the original data matrix, where each row is constructed using 0 for the dependent variables and the square root of $k$ or zero for the independent variables. One extra row is then added for each independent variable.

I was wondering whether it is possible to derive a proof for all cases, including for logistic regression or other GLMs.

Nope, I got it from http://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf and it was mentioned briefly on page 335-4 — Snowflake, Feb 10 '15 at 11:58
Sorry about deleting the comment on you there. I decided I was mistaken before I saw your reply and deleted it. — Glen_b, Feb 10 '15 at 12:11
A slight generalization of this problem is asked and answered at http://stats.stackexchange.com/questions/15991. Because it does not address the logistic regression part of this question, I am not voting to merge the two threads. — whuber, Feb 10 '15 at 14:13
GLMs are fit using iteratively reweighted least squares, as in http://bwlewis.github.io/GLM/, and so within each iteration one can subsitute the regular weighted least squares step with a ridge penalized weighted least squares step to get a ridge penalized GLM. In fact, in combination with adaptive ridge penalties this is used to fit L0 penalized GLMs, as in the l0ara package, see https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0159-z and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148620 — Tom Wenseleers, Aug 29 '19 at 19:26

Glen_b · Accepted Answer · 2015-02-10T12:21:35.077

17

Ridge regression minimizes $\sum_{i=1}^n (y_i-x_i^T\beta)^2+\lambda\sum_{j=1}^p\beta_j^2$.

(Often a constant is required, but not shrunken. In that case it is included in the $\beta$ and predictors -- but if you don't want to shrink it, you don't have a corresponding row for the pseudo observation. Or if you do want to shrink it, you do have a row for it. I'll write it as if it's not counted in the $p$, and not shrunken, as it's the more complicated case. The other case is a trivial change from this.)

We can write the second term as $p$ pseudo-observations if we can write each "y" and each of the corresponding $(p+1)$-vectors "x" such that

$(y_{n+j}-x_{n+j}^T\beta)^2=\lambda\beta_j^2\,,\quad j=1,\ldots,p$

But by inspection, simply let $y_{n+j}=0$, let $x_{n+j,j}=\sqrt{\lambda}$ and let all other $x_{n+j,k}=0$ (including $x_{n+j,0}=0$ typically).

Then

$(y_{n+j}-[x_{n+j,0}\beta_0+x_{n+j,1}\beta_1+x_{n+j,2}\beta_2+...+x_{n+j,p}\beta_p])^2=\lambda\beta_j^2$.

This works for linear regression. It doesn't work for logistic regression, because ordinary logistic regression doesn't minimize a sum of squared residuals.

[Ridge regression isn't the only thing that can be done via such pseudo-observation tricks -- they come up in a number of other contexts]

edited Feb 10 '15 at 12:21

answered Feb 10 '15 at 12:09

Glen_b

282,281

Thanks, I was already struggling with rewriting everything from logistic regression, but I simply couldn't implement the phoney data method. And I don't trust my own abilities sufficiently to be able to say that it is impossible. – Snowflake Feb 10 '15 at 12:28
At least I don't think it is. I'll take another look at the likelihood function. – Glen_b Feb 10 '15 at 12:31
3

+1 Additional related regression tricks are introduced in answers at http://stats.stackexchange.com/a/32753 and http://stats.stackexchange.com/a/26187, inter alia. – whuber Feb 10 '15 at 14:15
GLMs are fit using iteratively reweighted least squares though, as in http://bwlewis.github.io/GLM/, and so within each iteration one can subsitute the regular weighted least squares step with a ridge penalized weighted least squares step to get a ridge penalized GLM. In fact, in combination with adaptive ridge penalties this is used to fit L0 penalized GLMs, as in the l0ara package, see https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0159-z and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148620 – Tom Wenseleers Aug 29 '19 at 19:25
@TomWenseleers thanks, yes, that makes complete sense – Glen_b Aug 30 '19 at 01:05
@Glen_b Thanks for this - I've added another answer with some more details... – Tom Wenseleers Aug 30 '19 at 08:06

Tom Wenseleers · Answer 2 · 2019-08-30T14:47:56.443

Generalizing this recipe to GLMs indeed is not difficult as GLMs are usually fit using iteratively reweighted least squares. Hence, within each iteration one can subsitute the regular weighted least squares step with a ridge penalized weighted least squares step to get a ridge penalized GLM. In fact, in combination with adaptive ridge penalties this recipe is used to fit L0 penalized GLMs (aka best subset, ie GLMs where the total number of nonzero coefficients are penalized). This has been implemented for example in the l0ara package, see this paper and this one for details.

It's also worth noting that the fastest closed-form way of solving a regular ridge regression is using

lmridge_solve = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  solve(crossprod(X) + diag(lambdas), crossprod(X, y))[, 1]
}

for the case where n>=p, or using

lmridge_solve_largep = function (X, Y, lambda) (t(X) %*% solve(tcrossprod(X)+lambda*diag(nrow(X)), Y))[,1]

when p>n and for a model without intercept.

This is faster than using the row augmentation recipe, i.e. doing

lmridge_rbind = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  qr.solve(rbind(X, diag(sqrt(lambdas))), c(y, rep(0, ncol(X))))
}

If you would happen to need nonnegativity constraints on your fitted coefficients then you can just do

library(nnls)

nnlmridge_solve = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  nnls(A=crossprod(X)+diag(lambdas), b=crossprod(X,Y))$x
}

which then gives a slightly more accurate result btw than

nnlmridge_rbind = function (X, y, lambda, intercept = TRUE) {
  if (intercept) {
    lambdas = c(0, rep(lambda, ncol(X)))
    X = cbind(1, X)
  } else { lambdas = rep(lambda, ncol(X)) }
  nnls(A=rbind(X,diag(sqrt(lambdas))), b=c(Y,rep(0,ncol(X))))$x 
}

(and strictly speaking only the solution nnls(A=crossprod(X)+diag(lambdas), b=crossprod(X,Y))$x is then the correct one).

I haven't yet figured out how the nonnegativity constrained case could be further optimized for the p > n case - let me know if anyone would happen to know how to do this... [lmridge_nnls_largep = function (X, Y, lambda) t(X) %*% nnls(A=tcrossprod(X)+lambda*diag(nrow(X)), b=Y)$x doesn't work]

Sextus Empiricus · Answer 3 · 2024-03-31T20:08:58.223

One could use row augmentation to get something that approximates the ridge penalty.

The GLM regression with response $y_i$ and predictors $X_i$ minimizes the terms

$$w_i\frac{(y_i-g(\beta X_i))^2}{V(g(\beta X_i))}$$

where $g$ is the inverse link function, $V$ a function that describes the variance as function of the mean, and $w_i$ an additional weight.

For logistic regression we could augment rows with $y_i = 0.5$ and for one of the $x_{ij}$ in $X_i$ some very small value (and the others zero) such that the inverse link function is approximately linear

$$g(\beta X_i) = \frac{1}{1+e^{-\beta_j x_{ij}}} \approx \frac{1}{2} + \frac{x_{ij}}{4} \beta_j$$

and with $V(\mu) = \mu(1-\mu)$ for logistic regression, the additional term will be approximately

$$w_i\frac{(y_i-g(\beta X_i))^2}{V(g(\beta X_i))} \approx w_i\frac{(0.5-(0.5+0.2 5 x_{ij} \beta_{j}))^2}{0.5^2} = 0.25 w_ix_{ij}^2 \beta_j^2$$

Then by changing the weight $w_i$ the amount of regularisation can be changed.

Demonstration with some code

library(glmnet)
settings
n = 100
lambda = 1
set.seed(1)
create data
x0 = rep(1,n)
x1 = runif(n,-3,3)
x2 = runif(n,-3,3)
p = 1/(1+exp(-x0-x1-x2))
y = rbinom(n,1,p)
logistic regression with ridge penalty
using glmnet function
mod1 = glmnet(cbind(x1,x2), y, 
              family = binomial(link = "logit"), 
              alpha = 0, lambda = lambda,
              thresh = 10^-12, 
              standardize = FALSE)
augment data for penalty
x = 0.001
w_penalty = 4lambdan/x^2
y = c(y,0.5,0.5)
x0 = c(x0,0,0)
x1 = c(x1,x,0)
x2 = c(x2,0,x)
create a weights vector
w = c(rep(1,n) , rep(w_penalty,2))
logistic regression with ridge penalty
using augmented data
mod2 = glm(y ~ 0+x0+x1+x2, 
           family = binomial(), weights = w, 
           control = glm.control(maxit = 10^5, epsilon = 10^-12))
mod1$a0
mod1$beta
s0
0.6600794
2 x 1 sparse Matrix of class "dgCMatrix"
s0
x1 0.2250353
x2 0.2392547
mod2$coefficients
x0        x1        x2
0.6600794 0.2250353 0.2392547

The same approach may be used for other link functions and distribution families if the inverse link function has linear approximation near zero $g(\beta) \approx a+b\beta$.

Ridge penalized GLMs using row augmentation?

3 Answers3

Demonstration with some code

settings

create data

logistic regression with ridge penalty

using glmnet function

augment data for penalty

create a weights vector

logistic regression with ridge penalty

using augmented data

s0

0.6600794

2 x 1 sparse Matrix of class "dgCMatrix"

s0

x1 0.2250353

x2 0.2392547

x0 x1 x2

0.6600794 0.2250353 0.2392547

Linked