0

suppose I have categorical dataset, I'm doing data pre-processing. what is the correct order of applying these 3 techniques

  • Train Test split
  • SMOTEN to over sampler the minority class
  • Categorical encoding of variables (a mix of one hot, label encoding and target encoding)
  • Why do you want to oversample the minority class? See https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he – J-J-J Mar 07 '23 at 11:57

1 Answers1

0

While one hot and label encoding can be applied dataframe wise before splitting (using e.g. pandas routines), it's better to split first and build a proper pipeline which would simplify the input of the new data without extra manual steps.

Target encoding should always be done after splitting, otherwise it creates a huge target leakage.

Same applies to augmentation: you want your test set to represent the production data, not the altered data. (Worth noting using this for resampling is often suboptimal, consider the sources mentioned in the metathread).

imblearn SMOTEN implementation expects categorical features only iirc, so it should come before encoding.

So, the ideal order would be:

  1. Split
  2. SMOTEN
  3. Encoding

Assuming you use sklearn-compatible routines, you should fit the transformers on train set only and use those to transform both train and test sets.

dx2-66
  • 361