Should I encode the categorical data before making a training validation split?

Question

I am looking at some examples in kaggle and I'm not sure what is the correct approach. If I split the training data for training and validation and only encode the categorical data in the training part sometimes there are some unique values that are left behind and I'm not sure if that is correct.

score 0 · Accepted Answer · answered Apr 16 '22 at 06:10

0

Yes encode the data before the split. The point of the split is to try to represent two i.i.d. samples from the data generating process. Encoding the data simply represents the data in a different manner.

answered Apr 16 '22 at 06:10

Adam

896
1
6
13

Should I encode the categorical data before making a training validation split?

1 Answers1