2

I am running glmnet for the first time and I am getting some weird results.

My dataset has n = 139; p = 70 (correlated variables)

I am trying to estimate the effect of each variable for both, inference and prediction. I am running:

> cvfit = cv.glmnet(X, Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = T,nlambda=100,type = "mse")

> coef(cvfit, s = "lambda.min")

From all the 70 estimates, two caught my attention:

4           0.5731999

14          5.419356829

What bugs me is the fact that:

> cor(X[,4],Y)

[1,] 0.674714

> cor(X[,14],Y)

[1,] -0.01742419

In addition, if I standardize X myself (using scale(X)) and run it again:

> cvfit = cv.glmnet(scale(X), Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = F,nlambda=100,type = "mse")

> coef(cvfit, s = "lambda.min")

I now get that 4 has the highest effect and variable "14" is about 5 times smaller. I couldn't find a good description about the normalization process in glmnet. Any clue as to why this is happening (I don't think its a bug, I just would like to understand why and which one is right)?

PS: I ran this many times, so I know it is not an effect of the sampling during the cross-validation.

Silverfish
  • 23,353
  • 27
  • 103
  • 201
Marc
  • 21
  • It's not quite clear whether this is a question about software or a question about Statistics. – Steve S Dec 01 '14 at 07:32
  • 4
    It is possible to construct examples where $cor(y,x_1)=0$, $cor(y,x_2)=0$, $cor(y,x_3)>0$ and yet $y=a x_1+b x_2$ thus when you regress $y$ on $x_1$, $x_2$, $x_3$, you will have significant coefficients for $x_1$ and $x_2$ only. That is, bivariate correlations are not informative enough when considering a multivariate regression. In turn, this means your case might be just what it is and not a result of an error. – Richard Hardy Dec 01 '14 at 19:35
  • Thank you @RichardHardy and SteveS. This question in fact does involve elastic net/statistics. However I perhaps failed to mention the following: When I run the code with "standardize=F", I get that variable "14" is set to 0 and variable "4" is the highest one. When I normalize myself:

    X<-scale(RAWdataCovariates)

    and run it again with standardize = F, variable "4" is again the highest one, and variable "14", although non-zero is 5 times less than 14.

    I couldn't find a good description on the manual of why their standardization is different then standardizing to mean 0 and var 1 thanks

    – Marc Dec 02 '14 at 00:45
  • 1
    Thank @SteveS. After I edited my question, the issue became more a standardization issue. I later found this post: http://stats.stackexchange.com/questions/33674/why-do-lars-and-glmnet-give-different-solutions-for-the-lasso-problem

    But still could not find a recommendation whether standardize = T is better than x = scale(X) and standardize = F.

    – Marc Dec 02 '14 at 01:07

4 Answers4

12

I tracked down the standardization process of glmnet and documented it on the thinklab Platform there. This includes a comparison of the different ways to use standardization with glmnet.

Long story short, if you let glmnet standardize the coefficients (by relying on the default standardize = TRUE), glmnet performs standardization behind the scenes and reports everything, including the plots, the "de-standardized" way, in the coefficients' natural metrics.

  • 3
    It really bothers me that the plots are unstandardized, it makes them very difficult to interpret. – Matthew Drury Jul 16 '16 at 06:00
  • Hi @Antoine Lizée , thank you for your documentation on thinklab. I have a question. I rescaled the variables myself and set standardize = false. Does that also produce coefficients returned on the original scale? I am confused on whether we need to further standardize those coefficients separately, or if there is a reason for using the unstandardized coefficients? – Michelle Sep 06 '17 at 02:30
  • @Antoine Lizée Could you please explain why you didn't set standardize=FALSE in the R report when you input the standardized data? If you set standardize=FALSE, then the coefficients will not be the same. – Blain Waan Jan 22 '20 at 22:57
6

the package will return transformed coefficients. line 1074 of fortran file in glmnet5.f90 is the transformation of gaussian type, as shown in below.

ca(l,k)=ys*ca(l,k)/xs(ia(l))                                          982

I believe this transformation may inflate the coef of variables with small standard deviation. If the sd(X) differes in training and testing dataset, I think this may cause bigger mse.

ZFY
  • 61
  • 1
  • 3
1

From the documentation here: https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf (top of page 8)

It seems that y is also standardized, you could try and rerun your example with the standardized y and see if it matches the result.

0

standardize = is a flag symbol, tells the status of X prior model fitting. But the result always returns in the form of original scale.

So if you want the glmnet to help you standardize your variables, you should set standardize = FALSE.

Keyu Nie
  • 11
  • 1
  • 1
    This answer is rather brief by our standards, do you think you could expand on it somewhat? – Silverfish Apr 15 '16 at 20:52
  • Unless I'm mis-reading the documentation, your last sentence is incorrect. If standardize = F, glmnet doesn't standardize the x , it assumes that is was done prior . – meh Jul 12 '16 at 12:00
  • Well, guess we both should check the documentation again... I was quoting the original documentation acutally. – Keyu Nie Aug 30 '16 at 03:44
  • this second sentence in this response is wrong, the documentation states:

    • standardize is a logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE.

    – David Oct 03 '22 at 19:48