0

I was reviewing a ML notebook when part of the EDA looks at the cardinality of categorical variables. As the notebook was prepared there was no strange result, but what if an attribute has a very high cardinality. For example, if a Dataset of 10000 rows has a cardinality of 5000, that is 50% of the size of the data. Searching for stack overflow they propose different solutions to vectorise it.

https://stackoverflow.com/questions/33043222/features-with-high-cardinality-how-to-vectorize-them

But in my opinion this attribute should be discarded because it is not useful to predict anything. is this a false assumption ? is there a rule ?

1 Answers1

1

In your example, many of these 5000 values must occur only once, so as you say you cannot fit them. However, it might be that a few values occur hundreds of times, or even more, and these values might be predictive. This is not uncommon with text data.

chrishmorris
  • 1,780