Order of pre-processing the dataset

Question

suppose I have categorical dataset, I'm doing data pre-processing. what is the correct order of applying these 3 techniques

Train Test split
SMOTEN to over sampler the minority class
Categorical encoding of variables (a mix of one hot, label encoding and target encoding)

Why do you want to oversample the minority class? See https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he — J-J-J, Mar 07 '23 at 11:57

score 0 · Accepted Answer · answered Mar 07 '23 at 11:50

While one hot and label encoding can be applied dataframe wise before splitting (using e.g. pandas routines), it's better to split first and build a proper pipeline which would simplify the input of the new data without extra manual steps.

Target encoding should always be done after splitting, otherwise it creates a huge target leakage.

Same applies to augmentation: you want your test set to represent the production data, not the altered data. (Worth noting using this for resampling is often suboptimal, consider the sources mentioned in the metathread).

imblearn SMOTEN implementation expects categorical features only iirc, so it should come before encoding.

So, the ideal order would be:

Split
SMOTEN
Encoding

Assuming you use sklearn-compatible routines, you should fit the transformers on train set only and use those to transform both train and test sets.

Order of pre-processing the dataset

1 Answers1