Preprocessing on training set only or both training & test set? Seems like there would be errors for both answers

Question

Let's say I have a dataset that hasn't been split into train/test yet.

Upon loading it, I discover that there are columns where there are nulls that need to be filled in, some quadratic relationships that results in us squaring the independent variable in the model, binary strings that need to be one-hot encoded, and a column that needs to be StandardScaler()'ed.

Are you supposed to do all these preprocessing tasks before or after the train/test split?

If you answered before (aka "preprocessing applies to the training set only"):

How you can only apply these to the train set only, but not have the model error when you apply it to the test set (mismatched number of columns is a glaring one that comes to mind)?
In addition, when you deploy a model in production, I'm assuming the data that is live will also have all the bugs (nulls, no quadratic variable, binary strings instead of one hot encoding, unscaled data) that you had to deal with in the train dataset, and that would make the model error if you did not preprocess the test set?

If you answered after (aka "preprocessing applies to the training and test set"):

Isn't there data leaking issues into the model if you preprocess the test data (eg. filling in NA's with the mean calculated inclusive of the test set)
How would you preprocess the data live?

Could you explain what you mean by "not have the model error"? And could you clarify what you mean, exactly, by "preprocessing"? Of course any processing you perform on the training set must be performed in production--how could you implement your model otherwise? — whuber, Dec 29 '22 at 20:57
Very similar to https://stats.stackexchange.com/q/599508/22311 — Sycorax, Dec 29 '22 at 21:34
@whuber lets say my orig dataset has 10 columns. if i have a string column that has 5 categories, i will need to add 5 new columns to my training dataset (because of the one hot encoding). If i run my trained model on the test set, it will give me an error (something like: "mismatched columns, expected 15 columns, got 10") -- which makes sense because i did not add the 5 new columns into the test set (test has 10 columns vs. train has 15). You can argue that i shouldnt touch the test data, or should I go ahead and add those 5 columns into the test set as well? — Katsu, Dec 30 '22 at 06:37
It depends. If you know beforehand what all the categories must be, then of course you can encode them all in training even when some do not appear in the training set. But if you don't know them beforehand, their identities are new information uncovered during data collection: you need a model that can accommodate additional categories. This particular question has been addressed in other threads here on CV, btw. — whuber, Dec 30 '22 at 14:20

Sycorax · Answer 1 · 2022-12-30T22:49:22.627

The solution is to compute the statistics of interest using the training set, and then apply the desired method to the test set, using stored statistics as needed.

StandardScaler stores the means and standard deviations of the data used in the fit method. The transform method applies the stored mean and standard deviations to the data that you supply.
Similar logic can be implemented using imputation methods.
Squaring some values $x$ depends only on the values $x$. This does not depend on training data, so there's not "data leak" here.
One-hot encoding makes the assumption that the training data represent, in some way, the universe of possible values. Suppose the training data contain values A, B, C. These can be one-hot encoded using 3 columns of 1s and 0s. But if the testing or production data contain only values A, B, then you'll need to insert a column of 0s to represent C. Likewise, if testing data contains values A, B, C, D, E, F, then you'll still only make 3 columns for A, B, C and the values for D, E, F will all have identical representations as 3 columns of zeros. This is a weakness of one-hot encoding; whether or not this weakness is fatal to the model depends on the task.

in addition sparse encoding is often used for dummy data. ie your input representation is a (python) dictionary of features with non zero values eg for so called bag of words. unknown words will typically be silently ignored ... — seanv507, Dec 29 '22 at 21:49

Preprocessing on training set only or both training & test set? Seems like there would be errors for both answers

1 Answers1