2

Bottom line up front: is there any reason not to center and scale continuous variables prior to model fitting for the sake of conducting model comparison?

I'm conducting a model comparison on a large data set (80,000 instances x 300 attributes) and I'm looking at predicting 250 different response values. If I compare 5 inducers- let's say cubist, boosted trees, random forest, MARS, and kNN- I'm already looking at 1,250 model fits without doing any parameter tuning (and counting the ensemble methods as a single fit). Although this is just the exploratory phase, I know that some models are sensitive to center and scaling (like kNN) and others aren't. Am I conducting my due diligence in comparing these models on an even playing field if I center and scale all of my numeric features so that I can use one predictor matrix for all 250 response vectors rather than mixing and matching? Or can some algorithms actually suffer from variables being transformed?

Note that I am not at all worried about interpretability of the resulting model.

1 Answers1

1

Yes, there are reasons. As I saw many questions about scaling, I ended up writing a small article about scaling here.

In short:

  1. Scaling may be a pure waste of time (especially true for decision tree based methods).

  2. Scaling may harm you performance (think about image classification)

  3. When scaling, you have to pay extra attention to constant columns / NAs... (though this can be handled writing more tests)

RUser4512
  • 10,217