0

I was wondering, in the feature scaling part of data preprocessing, why is the data in the testing set standardized using the fit values from the training set? Why aren't the fit values recalculated separately for the testing set and then used for transforming the testing set? Here is the code I am using as reference:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 1:3])
x_test[:, 3:] = sc.transform(x_test[:, 1:3])

In this example, x_train is the training set and x_test is the testing set.

Thanks!

dkapur17
  • 103
  • 1
  • x_test[:, 3:] = sc.transform(x_test[:, 1:3]) . In your code,In this line.. testing data is NOT being fit using "Trainng data". "sc" is standard Scalar being applied on test data.Are you referencing some other code? – BlackCurrant May 12 '20 at 22:55
  • @BlackCurrant, isn't the fit data calculated in line 3 and then just being used to transform the testing set as well? If not, what is the transformation of the testing set based on? – dkapur17 May 12 '20 at 23:00
  • I think the answer is here -https://stackoverflow.com/questions/48692500/fit-transform-on-training-data-and-transform-on-test-data – BlackCurrant May 12 '20 at 23:53

1 Answers1

2

You could do that. Assuming your train/test data are from the same distribution, which is in fact what you assume, they should have a similar mean and standard deviation. The result would be similar, in theory, so by the same token, why would you refit the scaling on test data?

In practice, the mean/variance estimate from the presumably smaller test set is probably not as accurate. You will be transforming the test data differently than the training data, and the model may not fit as well.

Taking it to a logical conclusion, how would you then predict for a single future instance? you can't scale one instance based on stats from one instance; it has no variance.

The principle here is to fit a model on training data and then apply that fixed model, of which scaling may be a part, to future data. So there is no particular upside to refitting this specially, and some potential downside.

Sean Owen
  • 6,595
  • 6
  • 31
  • 43