1

Should I apply pd.get_dummies() for both train and test data? And would it not result in data leakage?

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34

1 Answers1

1

If you use pandas.get_dummies on the train and test data separately you will likely run into issues because it is likely that there are new values in the test dataset which are not in the training dataset. It is therefore better to use something like sklearn.preprocessing.OneHotEncoder which can save state and encode the test dataset based on the values that were seen in the training dataset.

Oxbowerce
  • 7,472
  • 2
  • 8
  • 23