8

In the context of feature selection it is common to recode categorical variables with more than 2 categories into dummies. Selection methods such as elastic nets or lasso regression select the best predictors, whereby it is possible that only some dummies of each categorical variable are selected. I am wondering, if there can appear some problems due to this procedure. I found some comments about the topic on Quora and a tutorial, stating that the procedure should be used carefully, but that there are no general problems. However, I was not able to find any detailed literature or any educated guidelines, which could be followed.

Question: Can there appear any problems, if not all dummies of a categorical variable are selected for a model?

For example, I could imagine that the automatic selection relies on the order of the categories and the resulting reference category. Let's say there is a variable with categories A, B, and C. A dummy recoding into dummyB and dummyC would probably result in different variable selections compared to a dummy recoding into dummyA and dummyB.

Any advice or literature is highly appreciated!

UPDATE:

Based on Ben's comment I found some literature about a comparison of the lasso and the group lasso, which addresses my question:

http://pages.stat.wisc.edu/~myuan/papers/glasso.final.pdf

http://people.ee.duke.edu/~lcarin/lukas-sara-peter.pdf

However, based on this literature 2 further questions appeared:

1) It seems like the normal lasso is still used regularly, whereby the group lasso doesn't appear that often in current literature. Is there a specific reason for that?

2) When I have categorical variables with many categories, isn't it a problem, if I select the whole categorical variable? Or in other words, is it sometimes advantageous to use the lasso instead of the group lasso?

  • 2
    this is the original motivation for the group lasso: http://pages.stat.wisc.edu/~myuan/papers/glasso.final.pdf – user795305 Apr 11 '17 at 17:59
  • Thank you, I just read the paper and it addresses my question very well! However, I am still wondering about 2 thing: 1) It seems like the normal lasso is still used regularly, whereby the group lasso doesn't appear that often in current literature. Is there a specific reason for that? 2) When I have categorical variables with many categories, isn't it a problem, if I select the whole categorical variable? Or in other words, is it sometimes advantageous to use the lasso instead of the group lasso? – Joachim Schork Apr 12 '17 at 07:11
  • 2
    Hey, no problem. 1) I'm not sure it's true that group lasso isn't used much. All the same, maybe one reason that it isn't used as often as it should is that sometimes it's hard to specify the groups perfectly. Group specification is easy when you're grouping levels of a categorical feature together, but it can get difficult in other situations. – user795305 Apr 15 '17 at 18:15
  • 2
  • These lasso type methods promote sparsity. One type of sparsity we might want is a kind of "measurement sparsity"--so that we don't have to measure too many features. In that setting, group lasso is more natural than plain lasso when applied to a categorical feature. It varies though depending on what you're after.
  • – user795305 Apr 15 '17 at 18:15