How to encode a categorical feature with high cardinality?

Question

Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?

is item description a long string? meaningful english string? — Zabir Al Nazi, May 04 '20 at 05:11

avvinci · Answer 1 · 2020-05-14T15:51:36.213

Options:

i) Use Target encoding.

More on target encoding : https://maxhalford.github.io/blog/target-encoding/
Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]

ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.

Tutorial : https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088
Notebook implementations:
1. https://www.kaggle.com/aquatic/entity-embedding-neural-net
2. https://www.kaggle.com/abhishek/same-old-entity-embeddings

iii) Use Catboost :

Tutorial : https://www.kaggle.com/mitribunskiy/tutorial-catboost-overview/notebook

Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

yatu · Answer 2 · 2020-05-04T08:19:19.280

You could look into the category_encoders. There you have many different encoders, which you can use to encode columns with high cardinality into a single column. Among them there are what are known as Bayesian encoders, which use information from the target variable to transform a given feature. For instance you have the TargetEncoder, which uses Bayesian principles to replace a categorical feature with the expected value of the target given then values the category takes, which is very similar to LeaveOneOut. You may also check the catboost based CatBoostEncoder which is a common choice for feature encoding.

score 0 · Answer 3 · answered Sep 21 '21 at 07:23

For variables like "item_description" which are in essence text variables, check this paper and corresponding Python package.

Or simply search online for "dirty categorical variables" and if in doubt, it is the article and package are from Gal Varoquaux, one of the main developers from Sklearn.

How to encode a categorical feature with high cardinality?

3 Answers3

Linked