Featuring Enginnering: High cardinality

Question

I was reviewing a ML notebook when part of the EDA looks at the cardinality of categorical variables. As the notebook was prepared there was no strange result, but what if an attribute has a very high cardinality. For example, if a Dataset of 10000 rows has a cardinality of 5000, that is 50% of the size of the data. Searching for stack overflow they propose different solutions to vectorise it.

https://stackoverflow.com/questions/33043222/features-with-high-cardinality-how-to-vectorize-them

But in my opinion this attribute should be discarded because it is not useful to predict anything. is this a false assumption ? is there a rule ?

score 1 · Accepted Answer · answered Mar 24 '22 at 16:53

1

In your example, many of these 5000 values must occur only once, so as you say you cannot fit them. However, it might be that a few values occur hundreds of times, or even more, and these values might be predictive. This is not uncommon with text data.

answered Mar 24 '22 at 16:53

chrishmorris

1,780

Yes you are right, it could be that one value of the 5000 appear later hundreds of times. Would you suggest checking how many times each of these 5000 valuey appears and if it is less than 5%(for example) discard it ? – Enrique Benito Casado Mar 24 '22 at 17:00
1

Yes - or even better, merge the rare ones into an "other" category. – chrishmorris Apr 04 '22 at 10:46
thanks Chris :) – Enrique Benito Casado Apr 04 '22 at 11:23

Featuring Enginnering: High cardinality

1 Answers1