3

I'm working through the math of ridge regression and am having some trouble replicating some R functions. My question is NOT about R, per se, I will just use it here for a reproducible example but really am trying to learn how to build an implementation. I'm assuming the R code implementations are correct and that mine is incorrect for now.

Let $\bf{X}$ be an $n \times p$ model matrix, $\bf{y}$ be an $n \times 1$ vector of outcomes, $\lambda$ is some scalar, and $\bf{I}$ is a $p$-dimensional identity matrix. The algebraic representation of the estimator is written

$$ \bf{\hat{\beta}} = (\bf{X}'\bf{X} + \lambda\bf{I})^{-1}\bf{X}'\bf{y} $$

I use the term "algebraic" representation to mean a way to write the estimator, but the computational implementation might use a decomposition (e.g., QR/Cholesky) or augment the rows of $\bf{X}$ and $\bf{y}$. With that said, for purposes of my question, my code will look like the algebraic representation for some transparency.

This link here provides an example to use for learning ridge regression in the glmnet package and my code below is based on the objects created in this example, https://github.com/Statology/R-Guides/blob/main/ridge_regression.R

### Add intercept
X <- cbind(1,x)
theDiagonal <- diag(c(0, rep(best_lambda, 4)))
solve(crossprod(X) + theDiagonal) %*% crossprod(X,y)
### Do as an OLS 
lm(y~0+X)
solve(crossprod(X)) %*% crossprod(X,y)

The results of my home baked ridge estimation do not match those coming from the glmnet functions, but my home baked OLS does match internal OLS functions.

My question does seem to partly duplicate this question, but the answer here is not entirely transparent to me.

https://stackoverflow.com/questions/18163589/customized-ridge-regression-function

I've made a few assumptions here. First, no penalty is added to the intercept and also no rescaling of the predictors is used. If it is needed, I'm unclear on why. Nonetheless, I have also tried scaling the predictors be marginally be $x_i \sim N(0,1)$ and I do not match even when doing so that way.

Clearly I have a misunderstanding and would appreciate any guidance to help me in my learning.

EDIT: Adding in code and results from glmnet per suggestion

library(glmnet)

#define predictor and response variables y <- mtcars$hp x <- data.matrix(mtcars[, c('mpg', 'wt', 'drat', 'qsec')])

#fit ridge regression model model <- glmnet(x, y, alpha = 0) summary(model)

#perform k-fold cross-validation to find optimal lambda value cv_model <- cv.glmnet(x, y, alpha = 0) plot(cv_model)

#find optimal lambda value that minimizes test MSE best_lambda <- cv_model$lambda.min best_lambda

#produce Ridge trace plot plot(model, xvar = "lambda")

#find coefficients of best model best_model <- glmnet(x, y, alpha = 0, lambda = best_lambda) coef(best_model)

5 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) 479.381278 mpg -3.261227 wt 21.440464 drat 1.169642 qsec -19.069106

Results from my code yields

> solve(crossprod(X) + theDiagonal) %*% crossprod(X,y)
            [,1]
     523.7308078
mpg   -4.7963814
wt    12.8262329
drat   0.9825495
qsec -18.2354511
dhc
  • 193
  • 1
    why theDiagonal is defined that way? First of all, why "0" and "4" in the definition? Secondly, are you sure it is an order $p$ matrix instead of a vector? – Zhanxiong Nov 14 '22 at 16:04
  • Hi @Zhanxiong. My code should have made theDiagonal a matrix and not a vector. That was an error. Code in the question is edited now. – dhc Nov 14 '22 at 16:15
  • 3
    It would be nice if you could post a true reproducible example indicating the discrepancy between the results you are targeting and your actual results. This could be a misunderstanding about what the glmnet() code does and not an error in your representation of RR. – Noah Nov 14 '22 at 16:22
  • If $p = 4$, then shouldn't theDiagonal be defined as theDiagonal <- diag(rep(best_lambda, 4))? $X$, of course, needs to be centered first. – Zhanxiong Nov 14 '22 at 16:26
  • @Zhanxiong that's part of my question. In my assumptions above I state I do not added the penalty to the intercept and I could be wrong. If we do as you suggest, it would add the value of best_lambda to all diagonal elements including the intercept. – dhc Nov 14 '22 at 16:27
  • What is the question? Can you bring this more clearly in this relatively long story. – Sextus Empiricus Nov 14 '22 at 16:29
  • 1
    @Noah, code updated per suggestion – dhc Nov 14 '22 at 16:32
  • 1
    @SextusEmpiricus the question is why my home baked implementation of the RR does not match glmnet's implementation. My assumption is the codes above should be comparable--an assumption I am not certain of. – dhc Nov 14 '22 at 16:38
  • @dhc Thank you for that. Your code and question are clear to me but unfortunately I do not know the answer to this question. It's possible glmnet() scales its lambda in a particular way. For example, if I divide best_lambda by 2.103735, I get approximately the same coefficients when using this value in the manual calculation. Not sure why. – Noah Nov 14 '22 at 16:56
  • 2
    I know the answer to these issues: First: 'glmnet()standardizes the variables by default so yeah...standardize = FALSE, second: https://stats.stackexchange.com/questions/126109/coefficient-value-from-glmnet i.e. the betas are further scaled by the size of the dataset. See here: https://stats.stackexchange.com/questions/155362 for further discussion too. In short, welcome to theglmnet` hole, where all roads lead to reading FORTRAN code. Your code and understanding are correct so (+1). – usεr11852 Nov 14 '22 at 17:01
  • (And do not we are using coordinate descent right? That's not the world's solver so yeah, some minor movements away from the optimum are also in store.) – usεr11852 Nov 14 '22 at 17:08
  • 3
  • (and not forget that as the cv.glment needs some seed fixing first otherwise we get different best_lambda given how small our dataset is.) – usεr11852 Nov 14 '22 at 17:19

0 Answers0