0

What is best practice for scaling test data that goes beyond the scaling range of the train data ? particularly if you are using minmax scaling ?

For e.g. I minmax the following between [0,1]:

train_data = [2,4,7,0,12,4,5] 
train_data_scaled = [0.16666667, 0.33333333, 0.58333333, 0.        , 1.        , 0.33333333]

test_data [2,11,0,14]
test_data_scaled = [0.16666667, 0.91666667, 0.        , 1.16666667]

Note that for values exceeding the max value of training you have a value >1. Should you clip values prior to scaling ?

Note I am particularly working with neural networks.

AnarKi
  • 565
  • Scaling is part of building your model, i.e. when minmax scaling, you minmax scale your test and train data separately. This should then give you values in both data sets that are within the [0,1] interval and are continuous and prevent overfitting. – atirvine88 Apr 16 '20 at 12:20
  • @atirvine88 Thanks for this. I am actually concerned with the "squeezing of values" if i resale separately, i.e. you are now mapping values closer together but should be further apart. Also, I am not sure this would be better than clipping and then rescaling am interested to see results about this. Do you have any idea about that ? – AnarKi Apr 16 '20 at 12:27
  • @atirvine88 for e.g. imagine you get some outliers in the test data – AnarKi Apr 16 '20 at 12:28
  • @atirvine88 see https://datascience.stackexchange.com/questions/27615/should-we-apply-normalization-to-test-data-as-well – AnarKi Apr 16 '20 at 12:40
  • Gotcha. You are right: min-max scaling will reduce the variance of a feature and may mask outliers within your data. I'd evaluate both methods (removing outliers/not removing outliers + min-max scaling) using cross-validation on your train data. – atirvine88 Apr 16 '20 at 15:15
  • Agh, let me clarify something: you are right- you may have values that are outside of the zero to one interval, when scaling your test data. This is fine, although if your training data is representative as a whole, you wouldn't necessarily expect values outside of the range in your test data. Still, you do use the same parameters to scale, i.e. the range, that you used in the training data ON the test data. If you decided to remove outliers, you have to make sure that you use the same process from the training process on the test data. – atirvine88 Apr 16 '20 at 17:55

0 Answers0