1

I've read that some models, such as decision trees, don't require scaling to work effectively.

However, the author of the linked article states there's no downside to scaling data for a decision tree either.

In general, is there ever a downside to scaling data?

Edit:

For example, using SciKit-Learn's StandardScaler on everything?

Connor
  • 625

3 Answers3

2

Scaling does not affect most of the statistics of the dataset. It is a recommended - sometimes even required - pre-processing step for some machine learning algorithms.

However, note that it is a transformation that takes in consideration all observations in the dataset, affecting the true value of each observation for that particular dataset. It's not a transformation that can be done to a single isolated observation. Thus, you may struggle if you build a classification model from scaled data and then you want to classify a new isolated observation which has not been scaled with the original dataset.

Take a decision tree for instance. This is a simple model, mainly used for its explainability and interpretability, rather than its predictive power. A decision tree can be easily used by anyone without any specific training. The algorithm would work fine if you scale the data prior to fitting the model. But if you do so, then you could not easily apply the model to new data for classification. Example: where a decision tree would have a decision node on "age <45 years-old" (which is understandable), with scaled data you might end up with that node reading "age <0.021 years-old" (which is meaningless).

  • So there's no statistical issue, just an issue with interpretability? When you say "most of the statistics", what do you mean? – Connor Mar 31 '23 at 08:27
  • 1
    I mean the statistics of the dataset as a whole: it will not change covariance or correlation between variables; it will not affect a principal components analysis; you can still perform linear regression, though the intercept will be affected after scaling. For statistics within each variable - eg. mean, median, variance of each feature - those are directly affected by scaling. – jma.alves Mar 31 '23 at 08:44
  • 1
    If you rescale the data the covariance will change, as the units of $Cov(X, Y)$ are the units of $X$ multiplied by units of $Y$ – jcken Mar 31 '23 at 10:46
  • @jcken is the covariance scaling an issue? Or do they hold their original relative patterns? – Connor Mar 31 '23 at 12:40
  • 1
    @Connor the dependence structure will not change, (so e.g. correlation will remain the same). Covariance scaling is probably not an issue, unless you want to interpret the fitted parameters of your model, then you should factor in any scaling you have done – jcken Mar 31 '23 at 13:52
  • @jcken to clarify, do you mean the hyperparameters of my model? Or do you mean the weights and biases (or other internal parameters). – Connor Mar 31 '23 at 14:24
2

If you check the tag , you'll learn about many benefits of scaling, not all algorithms need it though. To answer if there are any downsides, consider what scaling is. Both standardization and normalization are about subtracting something and dividing by something. Let's discuss those operations.

To calculate feature such as "number of years since the year 1995" you would subtract 1995 from the current date. You could alternatively create a different feature for "number of years since the year 1997" and those two would only differ by what did you subtract. If your algorithm broke depending if your baseline was 1995 or 1997, there would be something very wrong with it.

The same applies to division. If your algorithm worked differently if your variables were in meters vs kilometers, or minutes vs seconds, it wouldn't be something that you could use to solver generic problems.

The downside of scaling could be worse interpretability (though in some cases it can be the other way around), but in general we don't want the algorithms to be sensitive to something like the scale of the features.

Finally, keep in mind that there are models that accept only certain kinds of features (e.g. only binary in vanilla LCA), where obviously you couldn't use scaled features.

That said, you should not mindlessly apply any feature transformation "as a default" even if it is harmless. If you did so, sooner or later you would regret it, because it would add unnecessary complication to the code, accidentally introduce bugs, slow down the code, or lead to other unanticipated problems. Every such a default behavior in a software has a history of GitHub issues or e-mails of angry users where in their specific case it led to something they didn't want or expect.

Tim
  • 138,066
  • Thank you! So the default position is to not scale if it's not needed because it avoids pipeline complexity? – Connor Mar 31 '23 at 09:47
2

Your updated question references application of 'StandardScaler' to everything. From the documentation by default this centers (by subtracting the mean) and then rescales every feature to a unit variance.

Clearly scaling features independently like this could cause differences in your analysis (consider a scaled vs unscaled PCA for example) which might be unintended. And the centering might cause some modelling not to work correctly (eg it will introduce negative values which will fail if a log-transformation or square root is indicated).

George Savva
  • 2,290
  • Thank you! I note that you said the Standard Scaler suffers from these issues. Is there another scaler that doesn't suffer from these issues/ any issues at all? Or are there always some downsides to consider? – Connor Mar 31 '23 at 10:43
  • 1
    I can't think of any transformation that is guaranteed to be free of unintended consequences. As @Tim says you need to think about things like transformations on a case-by-case basis. – George Savva Mar 31 '23 at 10:46