1

Trying to collect all the top reasons why we need to scale our independent variables in a ML model. I have 3 reasons that I've collected so far. Please lmk if I am missing any here.

  1. Correct for large nominal vars having a bigger impact to a classifier. Eg. Salary diff of $1K will have a higher impact than Age diff of 50 yrs.

  2. All X’s are on 1 universal scale vs. all X’s are on diff scales (eg. age, minutes, dollars)

  3. Which leads to better outlier detection. Same threshold for all variables to establish what constitutes as an outlier.

Katsu
  • 911
  • 1
    Do we need to scale the features? – Dave Jan 10 '23 at 22:06
  • 3 is a non-reason. Scaling the features won't let you detect outliers better. – Tim Jan 10 '23 at 23:29
  • @Tim yes, i meant you would have the same standard across all features for defining what constitutes as an outlier, vs. different standards for different variables. – Katsu Jan 11 '23 at 00:23
  • 1
    @Katsu I contest that claim. What if one of the features is skewed and another is uniform? – Dave Jan 11 '23 at 00:29
  • @dave isnt this just scaling? We havent transformed anything? So the uniform feature would remain uniform except it now has a mean of 0, and the skewed feature would remain skewed except it now has a mean of 0. You may log transform the skewed feature to make it normal but that is outside the scope of this question i believe. – Katsu Jan 11 '23 at 18:24
  • The threshold to consider an observation an outlier is likely to differ between a skewed and uniform distribution, even if they have the same mean and variance. – Dave Jan 11 '23 at 18:26
  • After you transform the skewed variable to a normal distribution wouldnt u have the same threshold? – Katsu Jan 11 '23 at 18:35
  • I find it hard to consider any point to be an outlier when the distribution is uniform. For a skewed distribution, however, there could be a bunching of data down low and then some stragglers way up high that might be considered outliers. – Dave Jan 11 '23 at 18:41
  • Really, you would still include those points in the extremities (eg. below 1st and above 99th percentile) of a normal distribution? – Katsu Jan 11 '23 at 19:23
  • What distribution is normal? Do you mean standardized to have a mean of zero and variance of one? – Dave Jan 11 '23 at 19:26
  • Yes, im imagining things above >=4/-4 SD as outliers https://blogs.sas.com/content/iml/files/2019/07/rule6895.png – Katsu Jan 11 '23 at 19:29

0 Answers0