I have a data set with 16 multi-level categorical predictors and one response variable, in order to fit LASSO to the data set on glmnet I transformed the categorical variables into dummy variables as specified in this post. What I didn't find an answer to is why the intercept argument was set to FALSE? I know that Group LASSO was formulated to handle categorical data, but the omitted intercept in the linked post has bugged me nonetheless.
My second question is: can Group LASSO handle a mixed data set with categorical and continuous variables?
- 77,844
- 145
2 Answers
If you have onehot encoded them, for one category, all its variables will be a linear combination of the intercept, making it redundant. For example using 1 y-variable and a categorical called cat:
cat = rep(LETTERS[1:3],each=2)
y = rnorm(6,rep(c(10,20,30),each=2))
onehot = model.matrix(~0+cat)
Intercept = rep(1,length(cat))
If we include the intercept it looks like this, and one of the variables will be driven to zero because it is not required.
cbind(Intercept,onehot)
Intercept catA catB catC
1 1 1 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 1 0
5 1 0 0 1
6 1 0 0 1
We can try it and you can see catB is driven to zero:
glmnet(x=onehot,intercept=TRUE,y=y,lambda=seq(0.1,0.9,by=0.1))$beta
3 x 9 sparse Matrix of class "dgCMatrix"
s0 s1 s2 s3 s4 s5 s6
catA -8.640918 -8.782487 -8.923908 -9.065330 -9.206751 -9.348172 -9.489594
catB . . . . . . .
catC 8.638616 8.779963 8.921384 9.062806 9.204227 9.345648 9.487070
s7 s8
catA -9.631015 -9.772437
catB . .
catC 9.628491 9.769912
Hence you set exclude intercept:
glmnet(x=onehot,intercept=FALSE,y=y,lambda=seq(0.1,0.9,by=0.1))$beta
s0 s1 s2 s3 s4 s5 s6
catA 8.960641 9.102062 9.243484 9.384905 9.526326 9.667748 9.809169
catB 18.874222 19.015644 19.157065 19.298486 19.439908 19.581329 19.722750
catC 28.785694 28.927116 29.068537 29.209958 29.351380 29.492801 29.634223
s7 s8
catA 9.95059 10.09201
catB 19.86417 20.00559
catC 29.77564 29.91707
These above is a simplified example, but in general this applies to most linear regression methods..
- 5,077
This is essentially answered here: Dropping one of the columns when using one-hot encoding. The summary is: The usual method to treat categorical variables with linear regression is to leave out one of the levels. That is not appropriate when using regularization, because it would treat the levels differently. But when all the levels are used in the dummy coding, the intercept is unnecessary (it is the sum of all the level dummys.)
- 77,844
standardizehere is set toTRUE)? and 2/ how do we interpret them when we setstandardizetoFALSE(because all the one-hot encoded variables here are on the same scale and range)? and 3/ which of the two aforementioned approaches do you advise me to follow when dealing with one-hot encoded categorical only data or continuous only data which are the same scale, do we standardize them or not because the two approaches give different results in glmnet? Thanks in advance and I really apologize for the annoyance. – Goldman Clarck May 07 '20 at 22:32