1

Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?

yatu
  • 80,714
  • 11
  • 64
  • 111
MBA
  • 93
  • 7

3 Answers3

3

Options:

i) Use Target encoding.

ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.

iii) Use Catboost :

Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

avvinci
  • 306
  • 1
  • 5
1

You could look into the category_encoders. There you have many different encoders, which you can use to encode columns with high cardinality into a single column. Among them there are what are known as Bayesian encoders, which use information from the target variable to transform a given feature. For instance you have the TargetEncoder, which uses Bayesian principles to replace a categorical feature with the expected value of the target given then values the category takes, which is very similar to LeaveOneOut. You may also check the catboost based CatBoostEncoder which is a common choice for feature encoding.

yatu
  • 80,714
  • 11
  • 64
  • 111
0

For variables like "item_description" which are in essence text variables, check this paper and corresponding Python package.

Or simply search online for "dirty categorical variables" and if in doubt, it is the article and package are from Gal Varoquaux, one of the main developers from Sklearn.

Sole Galli
  • 645
  • 4
  • 17