Why do we omit the intercept when applying LASSO to categorical data?

Question

I have a data set with 16 multi-level categorical predictors and one response variable, in order to fit LASSO to the data set on glmnet I transformed the categorical variables into dummy variables as specified in this post. What I didn't find an answer to is why the intercept argument was set to FALSE? I know that Group LASSO was formulated to handle categorical data, but the omitted intercept in the linked post has bugged me nonetheless. My second question is: can Group LASSO handle a mixed data set with categorical and continuous variables?

StupidWolf · Accepted Answer · 2020-05-07T10:10:24.773

2

If you have onehot encoded them, for one category, all its variables will be a linear combination of the intercept, making it redundant. For example using 1 y-variable and a categorical called cat:

cat = rep(LETTERS[1:3],each=2)
y = rnorm(6,rep(c(10,20,30),each=2))
onehot = model.matrix(~0+cat)
Intercept = rep(1,length(cat))

If we include the intercept it looks like this, and one of the variables will be driven to zero because it is not required.

cbind(Intercept,onehot)
  Intercept catA catB catC
1         1    1    0    0
2         1    1    0    0
3         1    0    1    0
4         1    0    1    0
5         1    0    0    1
6         1    0    0    1

We can try it and you can see catB is driven to zero:

glmnet(x=onehot,intercept=TRUE,y=y,lambda=seq(0.1,0.9,by=0.1))$beta
3 x 9 sparse Matrix of class "dgCMatrix"
            s0        s1        s2        s3        s4        s5        s6
catA -8.640918 -8.782487 -8.923908 -9.065330 -9.206751 -9.348172 -9.489594
catB  .         .         .         .         .         .         .       
catC  8.638616  8.779963  8.921384  9.062806  9.204227  9.345648  9.487070
            s7        s8
catA -9.631015 -9.772437
catB  .         .       
catC  9.628491  9.769912

Hence you set exclude intercept:

glmnet(x=onehot,intercept=FALSE,y=y,lambda=seq(0.1,0.9,by=0.1))$beta
            s0        s1        s2        s3        s4        s5        s6
catA  8.960641  9.102062  9.243484  9.384905  9.526326  9.667748  9.809169
catB 18.874222 19.015644 19.157065 19.298486 19.439908 19.581329 19.722750
catC 28.785694 28.927116 29.068537 29.209958 29.351380 29.492801 29.634223
           s7       s8
catA  9.95059 10.09201
catB 19.86417 20.00559
catC 29.77564 29.91707

These above is a simplified example, but in general this applies to most linear regression methods..

edited May 07 '20 at 10:10

answered May 07 '20 at 00:52

StupidWolf

5,077

"We can try it and you can see catC is driven to zero," isn't catB here the one driven to zero? If you don't mind me asking: If I have let's say 10 variables which are all categorical where each one of'em has 3 levels, if we one hot encode them all then we omit the intercept regardless if the regression is penalized or not? And If we encode all the categorical variables $n-k$ this time where where $n$ is the total number of levels and $k$ is the total sum of each omitted level from each categorical variable then we should keep the intercept, right? – Goldman Clarck May 07 '20 at 02:59
1

sorry it should be catB, thank you for pointing out the error. The penalization happens, it is dependent on the penalization penalty lambda. Normally you try to find the optimal lambda with cv, and unless you end up with an extremely small lambda, there is still penalization. – StupidWolf May 07 '20 at 08:48
1

Yes, you are right on the second. In R this is often referred to as the model matrix with contrast treatment (http://www.montana.edu/rotella/documents/502/DesignMatricesR.pdf). For example, if you do lm(y~cat), the underlying model matrix is model.matrix(~cat), exactly what you have described. You can of course use that model.matrix for glmnet – StupidWolf May 07 '20 at 08:51
Three final questions I promise: 1/ how do we interpret the coefficients here (standardize here is set to TRUE)? and 2/ how do we interpret them when we set standardize toFALSE (because all the one-hot encoded variables here are on the same scale and range)? and 3/ which of the two aforementioned approaches do you advise me to follow when dealing with one-hot encoded categorical only data or continuous only data which are the same scale, do we standardize them or not because the two approaches give different results in glmnet? Thanks in advance and I really apologize for the annoyance. – Goldman Clarck May 07 '20 at 22:32
1

scaling for a binary variable doesn't quite affect it, you only get 2 values. for your info 1. glmnet always converts the converts back to the original scale https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. to answer 2 and 3. i would standardize them. interpreting them.. you should remember this is not like a linear regression, there is a L2 regularisation. If you want to interpret the model in terms of the coefficients, you need to go back to linear models – StupidWolf May 08 '20 at 01:03

score 2 · Answer 2 · answered May 07 '20 at 12:59

This is essentially answered here: Dropping one of the columns when using one-hot encoding. The summary is: The usual method to treat categorical variables with linear regression is to leave out one of the levels. That is not appropriate when using regularization, because it would treat the levels differently. But when all the levels are used in the dummy coding, the intercept is unnecessary (it is the sum of all the level dummys.)

Why do we omit the intercept when applying LASSO to categorical data?

2 Answers2

Linked