0

I am looking at some examples in kaggle and I'm not sure what is the correct approach. If I split the training data for training and validation and only encode the categorical data in the training part sometimes there are some unique values that are left behind and I'm not sure if that is correct.

parse5214
  • 3
  • 1

1 Answers1

0

Yes encode the data before the split. The point of the split is to try to represent two i.i.d. samples from the data generating process. Encoding the data simply represents the data in a different manner.

Adam
  • 896
  • 1
  • 6
  • 13