I have a fairly reasonably sized dataset (row>50k). And I'm looking for the best way to utilize some of the categorical columns. For purpose of this question, let's say that one of the categorical column is zipcode. The premise is, after feature engineering, I'll pass this data to a random forest regressor in sklearn, which does not recognize categorical columns.
Let's say I have 500 unique zipcode. I could one-hot encode these, or pick the top 100 and then one-hot encode those (fill the rest with "OTHER" for instance), but they all generate a large amount of dimensionality, which I wanted to avoid. Here's the new idea that I have and I want to verify it with the community.
Let's say after I do train-test split, I take the train set, and average the individual zipcode group by the real labels they have, so for instance in the raw data:
zipcode label
zip10001 3
zip10001 2
zip10001 4
zip10002 1
zip10002 2
zip10010 7
after transform, becomes
zipcode label zipcode_avg
zip10001 3 3
zip10001 2 3
zip10001 4 3
zip10002 1 1.5
zip10002 2 1.5
zip10010 7 7
while also creating a dictionary:
dzipavg = {
"zip10001": 3,
"zip10002": 1.5,
"zip10010": 7
}
And instead of one-hot encoding the zipcode column, I'll simply drop it. And for the test column, I would map the zipcode with the dict test.["zipcode_avg"] = test.zipcode.map(dzipavg), and drop the zipcode column as well.
Do you think this is a good idea? Will there be any consequences that I have not seen? I don't think there's any data leak in here as all transformation is based on training data.