11

Working with data that use different dimensions, you do not want that one dimension dominate.

This means feature scaling! A very intuitive way is to use min-max scaling so you scale everything between 0 to 1.

What I do not understand and what is not intuitive for me at all is to use z-score for feature scaling.

Why is z-score used? What is the motivation to not use min-max and to use z-score? Why is it a good idea to scale your data in standard deviations from the mean? What was the motivation to use z-score for scaling? Why is min-max not used all the time? What problem does z-score solve what min-max does not solve?

hope someone can help me and make it somehow clear.

JaySmi
  • 181
  • 4
    One of many possible explanations is that the z-score is the Mahalanobis distance (in one dimension). See https://stats.stackexchange.com/questions/62092 for some explanations of what that means. There are all kinds of reasons not to scale by the range, not least is that with potentially unbounded data the range is one of the least stable statistics one can imagine. Related topics are correlation, (univariate) regression, and the 68-95-99.7 rule. – whuber Oct 07 '21 at 21:17
  • thanks for response whuber!

    Mahalanobis distance:

    the Mahalanobis distance makes sense for me to detect outliers, but I do not understand how you can use it to motivate your feature scaling with z-score

    range stability:

    you said that rang is one of the least stable in statistics. What do you mean by that? What stability do you mean? What does it mean to be stable? Why is the range not stable and on what sense? Why are standard deviation unit more stable?

    – JaySmi Oct 08 '21 at 15:22
  • 1
    Concepts of robust statistics will explain all this. – whuber Oct 08 '21 at 16:23
  • ohh, you mean robust like "robust to outliers"? If that is the case I still do not get it to pick the z-score to scale my data. The z-score uses the mean and not the median and we can show that the mean is not robust to outliers. I still do not see the motivation to use the z-score to scale my data and not only and always pick min-max. I mean if your data has outliers, we can track them down and remove them. What is if you have features that are not all normal, isn't min-max better? Maybe you have a simple example to see why it makes sense to use z-score over min-max. – JaySmi Oct 08 '21 at 22:19
  • 1
    If you don't "track them down and remove them," then the min or max or both will be outliers and using them for your normalization screws up all the data. That's the basic problem. – whuber Oct 09 '21 at 13:14
  • I agree with you tracking down outliers and removing them is needed. But that is not the problem I have. It is still min-max VS. z-score for features scaling. For me min-max is the most natural and instinctive way to scale data but somebody came up with the the idea to use z-score to scale features. I did get a lot of answers without the WHY. Just phrases like: "z-score handles outliers better" OR "z-score is good for unbound data". I would like to understand WHY these statements are right and what motivated people to use the z-score for scaling, why it works so well and why you should use it. – JaySmi Oct 09 '21 at 18:20
  • Thank you for your pertinent question which I also ask myself. Despite much research, I still can't find the answer. Have you been able to find a logical explanation that justifies the choice of min-max or z-score? – Fariss Aouda May 21 '23 at 08:52

1 Answers1

2

The answer to your specific question about why z-score normalisation handles outliers better is largely to do with how standard deviations are calculated in the first place. If there are outliers, then the effect that the deviation from the mean related to those outliers will have on the final statistic (i.e, the standard deviation; the same value that will be used to normalise the feature) will be mitigated by the rest of the deviations within that same feature. In short, standard deviation is an aggregated calculation so individual values will carry less weight with the more observations there are. Conversely with min-max scaling where the values used to normalise the data will literally be the outliers themselves (assuming there are outliers of course). No aggregating, no averaging, just take the minimum value, take the maximum value and normalise all the observations in the feature relative to those values. If those minimum and maximum values happen to be outliers then you can see how they will impact the resulting normalisation. As far as I can see, how important this difference is will probably depend on the model that the data is being preprocessed for, and the question of "why those outliers would be kept in the data in the first place" is also valid, but maybe that's another discussion entirely. Anyway, Hope this helps.