I'm working through the math of ridge regression and am having some trouble replicating some R functions. My question is NOT about R, per se, I will just use it here for a reproducible example but really am trying to learn how to build an implementation. I'm assuming the R code implementations are correct and that mine is incorrect for now.
Let $\bf{X}$ be an $n \times p$ model matrix, $\bf{y}$ be an $n \times 1$ vector of outcomes, $\lambda$ is some scalar, and $\bf{I}$ is a $p$-dimensional identity matrix. The algebraic representation of the estimator is written
$$ \bf{\hat{\beta}} = (\bf{X}'\bf{X} + \lambda\bf{I})^{-1}\bf{X}'\bf{y} $$
I use the term "algebraic" representation to mean a way to write the estimator, but the computational implementation might use a decomposition (e.g., QR/Cholesky) or augment the rows of $\bf{X}$ and $\bf{y}$. With that said, for purposes of my question, my code will look like the algebraic representation for some transparency.
This link here provides an example to use for learning ridge regression in the glmnet package and my code below is based on the objects created in this example, https://github.com/Statology/R-Guides/blob/main/ridge_regression.R
### Add intercept
X <- cbind(1,x)
theDiagonal <- diag(c(0, rep(best_lambda, 4)))
solve(crossprod(X) + theDiagonal) %*% crossprod(X,y)
### Do as an OLS
lm(y~0+X)
solve(crossprod(X)) %*% crossprod(X,y)
The results of my home baked ridge estimation do not match those coming from the glmnet functions, but my home baked OLS does match internal OLS functions.
My question does seem to partly duplicate this question, but the answer here is not entirely transparent to me.
https://stackoverflow.com/questions/18163589/customized-ridge-regression-function
I've made a few assumptions here. First, no penalty is added to the intercept and also no rescaling of the predictors is used. If it is needed, I'm unclear on why. Nonetheless, I have also tried scaling the predictors be marginally be $x_i \sim N(0,1)$ and I do not match even when doing so that way.
Clearly I have a misunderstanding and would appreciate any guidance to help me in my learning.
EDIT: Adding in code and results from glmnet per suggestion
library(glmnet)
#define predictor and response variables
y <- mtcars$hp
x <- data.matrix(mtcars[, c('mpg', 'wt', 'drat', 'qsec')])
#fit ridge regression model
model <- glmnet(x, y, alpha = 0)
summary(model)
#perform k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(x, y, alpha = 0)
plot(cv_model)
#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda
#produce Ridge trace plot
plot(model, xvar = "lambda")
#find coefficients of best model
best_model <- glmnet(x, y, alpha = 0, lambda = best_lambda)
coef(best_model)
5 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 479.381278
mpg -3.261227
wt 21.440464
drat 1.169642
qsec -19.069106
Results from my code yields
> solve(crossprod(X) + theDiagonal) %*% crossprod(X,y)
[,1]
523.7308078
mpg -4.7963814
wt 12.8262329
drat 0.9825495
qsec -18.2354511
theDiagonalis defined that way? First of all, why "0" and "4" in the definition? Secondly, are you sure it is an order $p$ matrix instead of a vector? – Zhanxiong Nov 14 '22 at 16:04glmnet()code does and not an error in your representation of RR. – Noah Nov 14 '22 at 16:22theDiagonalbe defined astheDiagonal <- diag(rep(best_lambda, 4))? $X$, of course, needs to be centered first. – Zhanxiong Nov 14 '22 at 16:26glmnet()scales its lambda in a particular way. For example, if I dividebest_lambdaby 2.103735, I get approximately the same coefficients when using this value in the manual calculation. Not sure why. – Noah Nov 14 '22 at 16:56standardizes the variables by default so yeah...standardize = FALSE, second: https://stats.stackexchange.com/questions/126109/coefficient-value-from-glmnet i.e. the betas are further scaled by the size of the dataset. See here: https://stats.stackexchange.com/questions/155362 for further discussion too. In short, welcome to theglmnet` hole, where all roads lead to reading FORTRAN code. Your code and understanding are correct so (+1). – usεr11852 Nov 14 '22 at 17:01cv.glmentneeds some seed fixing first otherwise we get differentbest_lambdagiven how small our dataset is.) – usεr11852 Nov 14 '22 at 17:19