0

When we encode categorical variables in a (generalized) linear model, the standard approach is to have one category subsumed by the intercept.

If we include such a variable in a neural network (along with other features), should we avoid this in order to allow all categories to enter the nonlinear activation functions and interactions with other features, or do we get away with allowing those for all but one category and then comparing to the remaining category, as we do in a (generalized) linear model?

In my mind, I am thinking of a feed-forward neural network, perhaps even just one with the input features followed by a hidden layer followed by an output, but I am happy to see a discussion for more sophisticated architectures like convolutional, recurrent, LSTM, etc.

Dave
  • 62,186
  • 1
    if you use l2 regularisation (AKA weight decay) it is unnecessary and it is recommended to keep the extra category - so each category is biased to zero equally – seanv507 Oct 30 '22 at 08:29
  • 2
    Another option with neural networks is to skip (one-hot) encoding categorical variables altogether and pass them through an embedding layer instead. – dipetkov Oct 30 '22 at 09:53
  • I almost always use embeddings, as suggested by @dipetkov – Michael M Oct 30 '22 at 10:43
  • My answer at https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281 might be relevant – kjetil b halvorsen Oct 31 '22 at 12:17

0 Answers0