I asked this question at stackoverflow, but at this point I'm not exactly sure where it belongs because it is a question related to the standardization process of glmnet and the lasso.
I am running glmnet and trying to see the difference when using standardize = FALSE with pre-standardized variables.
library(glmnet)
data(QuickStartExample)
### Standardized Way
x_standardized <- scale(x, center = TRUE, scale = TRUE)
y_standardized <- scale(y, center = TRUE, scale = TRUE)
cv_standardized <- cv.glmnet(x_standardized, y_standardized,
intercept = FALSE,
standardize = FALSE, standardize.response = FALSE)
destandardized_coef <- coef(cv_standardized)[-1] * sd(y) / apply(x, 2, sd)
destandardized_coef
mean(y) - sum(destandardized_coef * colMeans(x))
Let glmnet Stanardize
cv_normal <- cv.glmnet(x, y)
coef(cv_normal, cv_normal$lambda.min) %>% as.numeric()
Initially, I am standardizing the data myself and back transforming the coefficients. I would expect to see the same results but for some reason am getting slightly different coefficients.
My question is, how can I extract the same results, and why are the coefficients currently different this way?
EDIT: Here is my updated code based on @user2974951's response, it does appear that glmnet standardizes differently than the scale function in R. I want to note that I am fitting the initial model without the intercept since I am later calculating it by hand. Ultimately, this shouldn't matter as I provide a standardized x and y, the intercept should be zero.
set.seed(123)
library(glmnet)
library(tidyverse)
data(QuickStartExample)
# standardize data
n <- length(y)
y_mean <- mean(y)
y_centered <- y - mean(y)
y_sd <- sqrt(sum((y - y_mean) ^ 2) / n)
ys <- y_centered / y_sd
X_centered <- apply(x, 2, function(x) x - mean(x))
Xs <- apply(X_centered, 2, function(x) x / sqrt(sum(x ^ 2) / n))
ys <- y_centered / sqrt(sum(y_centered ^ 2) / n)
set.seed(123)
cv_standardized <- cv.glmnet(Xs, ys,
intercept = FALSE,
standardize = FALSE,
standardize.response = FALSE)
destandardized_coef <- coef(cv_standardized)[-1] * sd(y) / apply(x, 2, sd)
personal_standardize <- c(mean(y) - sum(destandardized_coef * colMeans(x)),
destandardized_coef)
set.seed(123)
cv_normal <- cv.glmnet(x, y, intercept = TRUE)
glmnet_standardize <- coef(cv_normal, cv_normal$lambda.min)
Results:
cbind(personal_standardize, glmnet_standardize)
21 x 2 sparse Matrix of class "dgCMatrix"
personal_standardize glmnet_standardize
(Intercept) 0.15473991 0.145832036
V1 1.29441444 1.340981414
V2 . .
V3 0.62890756 0.708347139
V4 . .
V5 -0.77785401 -0.848087765
V6 0.48387954 0.554823781
V7 . 0.038519738
V8 0.28419264 0.347221319
V9 . .
V10 . 0.010034050
V11 0.09130386 0.186365264
V12 . .
V13 . .
V14 -1.03241106 -1.086006902
V15 . -0.004444318
V16 . .
V17 . .
V18 . .
V19 . .
V20 -0.96190123 -1.069942845
Thanks in advance.
scaleuses n-1 when computing SD, glmnet does not, 2) your first model does not contain an intercept, while the second does. – user2974951 Apr 16 '21 at 08:07intercept = FALSE. I've edited the results. I see that the standardisation methods are different but am still struggling to get the same results with the correctglmnetstandardisation of dividing by n rather than n-1. – AW27 Apr 16 '21 at 08:14cv.glmnet. Try setting the seed to the same number just before each invocation of that function, and make sure that the cases are in the exact same order both times. Then post your results again. If that fixes your problem, please write it up as an answer to your question, as a service to others facing similar issues. – EdM Apr 16 '21 at 08:45set.seeddoesn't really alleviate the issue. – AW27 Apr 16 '21 at 08:53