1

I am interested in a discussion in encoding and scaling categorical features, notably imbalanced categorical features. The context is neural networks (gbdts should handle this easily). It is known that numerical values should be rescaled with a mean near zero and a covariance constant (1?) since at least Lecun 98.

However for rare event this mean it will bring unusually high values.

Typically, having a 1 in 1000 features:

x = np.random.random(1000000)>0.999
np.unique(((x-x.mean())/x.std()))

imply having two values after rescaling:

array([-3.01134784e-02,  3.32077214e+01])

That is 0.03 and 30. Is this the correct way to go? Are there any strong source on this?

Lucas Morin
  • 2,196
  • 5
  • 21
  • 42
  • Scaling for categorical variables is already... fishy, in case of asymmetric variables you might want to use robust scaling. – user2974951 Feb 02 '24 at 11:14
  • What do you mean by fishy? robust scaling? (because the robust scaling I know, from sklearn would give nans). – Lucas Morin Feb 02 '24 at 11:15
  • 1
  • They don't really provide a robust answer on what should be done and why. Only practical answer makes the assumption of balanced data. Regarding the robust scaling (that apply to continuous features) would not work here as 1st and 3rd quantile are equals. – Lucas Morin Feb 02 '24 at 13:12
  • That's the point, normalization makes sense only for continuous variables, since the coding of binary variables is arbitrary. Attempts at normalizing binary variables will cause headaches, as you have figured out. – user2974951 Feb 02 '24 at 13:18
  • Your Lecun link doesn't work (tries to point to your PC). – Ben Reiniger Feb 02 '24 at 15:13
  • 1
    See also https://datascience.stackexchange.com/q/31652/55122, https://datascience.stackexchange.com/q/80234/55122, https://datascience.stackexchange.com/q/56444/55122. The first one comes closest to an answer here: Neil Slater reports slight improvements in neural networks after scaling dummy variables (which will be imbalanced when there are several levels in the original categorical). With neural networks, it won't make a difference in the actual functional space unless you apply regularization; but it may well make a difference numerically while fitting. – Ben Reiniger Feb 02 '24 at 15:19

0 Answers0