I am running glmnet for the first time and I am getting some weird results.
My dataset has n = 139; p = 70 (correlated variables)
I am trying to estimate the effect of each variable for both, inference and prediction. I am running:
> cvfit = cv.glmnet(X, Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = T,nlambda=100,type = "mse")
> coef(cvfit, s = "lambda.min")
From all the 70 estimates, two caught my attention:
4 0.5731999
14 5.419356829
What bugs me is the fact that:
> cor(X[,4],Y)
[1,] 0.674714
> cor(X[,14],Y)
[1,] -0.01742419
In addition, if I standardize X myself (using scale(X)) and run it again:
> cvfit = cv.glmnet(scale(X), Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = F,nlambda=100,type = "mse")
> coef(cvfit, s = "lambda.min")
I now get that 4 has the highest effect and variable "14" is about 5 times smaller. I couldn't find a good description about the normalization process in glmnet. Any clue as to why this is happening (I don't think its a bug, I just would like to understand why and which one is right)?
PS: I ran this many times, so I know it is not an effect of the sampling during the cross-validation.
and run it again with standardize = F, variable "4" is again the highest one, and variable "14", although non-zero is 5 times less than 14.
I couldn't find a good description on the manual of why their standardization is different then standardizing to mean 0 and var 1 thanks
– Marc Dec 02 '14 at 00:45But still could not find a recommendation whether standardize = T is better than x = scale(X) and standardize = F.
– Marc Dec 02 '14 at 01:07