1

I am self learning machine learning right now, and I am confused with what should I do first.

  1. Should I impute the missing value before encoding the categorical variable?
  2. Also, I am learning from Kaggle, and it always split to train, test set before doing any feature engineering stuff. What is the reason behind it? Can I doing it for the entire dataset?
  3. When should I perform cross validation? Before splitting the data?

I also hope to know the reason behind all the decision because I don’t want to just memorize it. It was difficult to learn by myself for this extremely complex topic.

  • Similar Qs with As: https://stats.stackexchange.com/questions/499228/what-is-the-correct-order-in-a-machine-learning-model-pipeline, https://stats.stackexchange.com/questions/95083/imputation-before-or-after-splitting-into-train-and-test, https://stats.stackexchange.com/questions/440372/feature-selection-before-or-after-encoding, – kjetil b halvorsen Jul 20 '21 at 01:41
  • Note that data splitting is typically a bad idea unless n > 20,000. – Frank Harrell Jul 31 '22 at 12:13
  • @FrankHarrell Do you mean that one should not split the dataset into train and test set before doing any feature engineering, unless n > 20000? If so, why? – Ganesh Tata Jul 31 '22 at 18:34
  • 1
    I meant that data splitting is an enormously wasteful statistical procedure, giving unstable results unless the true signal:noise ratio is very high (outcomes are easy to predict) or n > 20,000. Details here. What is your sample size and distribution of Y? Most often resampling (100 repeats of 10-fold CV or 400 bootstrap reps) is more efficient than data splitting and also exposes the silliness of feature selection. – Frank Harrell Jul 31 '22 at 20:17

1 Answers1

0

Most times imputing missing values are for numeric features and has nothing to do with encoding which is for categorical data. So, deal with missing value before encoding will seem like a good choice.