Let's say I have a dataset that hasn't been split into train/test yet.
Upon loading it, I discover that there are columns where there are nulls that need to be filled in, some quadratic relationships that results in us squaring the independent variable in the model, binary strings that need to be one-hot encoded, and a column that needs to be StandardScaler()'ed.
Are you supposed to do all these preprocessing tasks before or after the train/test split?
If you answered before (aka "preprocessing applies to the training set only"):
How you can only apply these to the train set only, but not have the model error when you apply it to the test set (mismatched number of columns is a glaring one that comes to mind)?
In addition, when you deploy a model in production, I'm assuming the data that is live will also have all the bugs (nulls, no quadratic variable, binary strings instead of one hot encoding, unscaled data) that you had to deal with in the train dataset, and that would make the model error if you did not preprocess the test set?
If you answered after (aka "preprocessing applies to the training and test set"):
Isn't there data leaking issues into the model if you preprocess the test data (eg. filling in NA's with the mean calculated inclusive of the test set)
How would you preprocess the data live?