0

My data set is having a column Gender, so I have to apply One Hot Encodingto perform KMeans Clustering.

Q1. Should I take care about dummy variable trap here ?, should I remove one of the dummy columns ?

Q2. Should I use Label Encoding or One Hot Encoding in case of Clustering Algorithms ?

  • It is improper to do k-means clustering with categorical data. Well, it is sometimes used with naturally binary data such as p attributes, each can be present (1) or absent (0). In this case, one could argue that the 0-1 scale is a binned scale of the "original" unobserved or underlying continuos scale. On the ground of this idea they indulge in k-means. But categorical data such as "city" - it cannot be thought of as a simply binned continuous variable or a set of those. Irrespective of how you choose encode. (As a predictor in a regression - you can. But not in a cluster analysis.) – ttnphns Jul 22 '23 at 11:33

1 Answers1

0
  • Dummy variable trap is a problem only in classical regression models. It is not a problem for most of the machine learning models, including things like regularized regression.
  • Label encoding is a poor way of encoding categorical data. If you encode {red, green, blue} as {1, 2, 3} you are implicitly assuming that the categories have numerical sense, so for example red + 2 = blue, which doesn't make sense. Rarely it is a valid choice.
  • You should use dummy encoding or one-hot encoding. For the difference between them check the One-hot vs dummy encoding in Scikit-learn thread. TL;DR dummy encoding is one-hot encoding with one of the categories dropped, in some cases, you use one, in other another.
  • If your “gender” feature has only two categories you can use dummy coding, i.e. a column of zeros and ones. One-hot encoding would introduce a redundant column, so it is rarely used for two categories. On another hand, if your “gender” feature has more categories, one-hot encoding may make sense.
Tim
  • 138,066
  • So, I should not be afraid of dummy variable trap in other ML models rather than the regression models, right ? – mainak mukherjee Jul 22 '23 at 10:38
  • 1
    I prefer to think that dummy coding and one-hot coding are synonyms. More formal name for it is "indicator type of contrast (en)coding". Dummy is an older argo from statistics while one-hot is a younger one from machine learning. Both this and that can be with or without dropping of the "redundant" category. – ttnphns Jul 22 '23 at 11:08