2

I am working on an attrition dataset which has a large number of categorical parameters. Each categorical parameter has a high cardinality, so one-hot encoding them is out of question. I was looking for models which can handle categorical data with high cardinality and came across CatBoost and LightGBM. Catboost is working as expected. However, in case of LightGBM, I'm unable to use my categorical features. The following lines were picked up from the official documentation of LightGBM and I am struggling to understand those.

LightGBM can use categorical features as input directly. It doesn’t need to convert to one-hot encoding, and is much faster than one-hot encoding (about 8x speed-up).

Note: You should convert your categorical features to int type before you construct Dataset.

How do I convert nominal data to int?!

If I follow the documentation, I get the following error

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in the following fields: Business, Segment Desc, Family Desc, Class Desc, Job Desc, Site Tag, City Desc, Employee Group, Gender, Marital Status, Award Desc, Shift Schedule

Björn
  • 32,022
Ashish Samant
  • 23
  • 1
  • 3

1 Answers1

5

You can assign a integer number for every category (plus to be safe a category for "other" = anything new, where you maybe want to group rare categories).

There's already a lot of other answers out there on how one could also represent categorical features in terms of dimensionality reduction and ideas like target encoding/random effects/embeddings/. I'd especially consider ideas like target encoding and training a neural network with an embedding layer and then taking the embeddings for the categorical features as a feature for LightGBM. If you have a cold-start problem (i.e. some categories will initially have no data), then Bayesian target encoding can be helpful, e.g. if you are predicting a logit-proportion, then having e.g. a Beta(0.5, 0.5) prior for each category and providing LightGBM not just with the mean or median of the distribution, but also inter-quartile range, or the 90th and 10th percentile can tell the model about the uncertainty about the new category (we e.g. used that idea for predicting drug approvals, where there's constantly new drug classes etc.).

I can only, again, recommend to look at what people tend to do in Kaggle competitions, where LightGBM is widely used and high-cardinality categorical data (e.g. users, products, shops, locations is common). Besides browsing the forums for solutions to competitions, there's Kaggle competition GM Thakur's book, the Kaggle book, the book on the fast.ai course and an excellent How to Win a Data Science Competition: Learn from Top Kagglers course on coursera.org (as of May 2023 inaccessible, if you have not already enrolled, due to the association of the course with Moscow university).

Björn
  • 32,022