One hot encoding vs apply the average of the label to each category

Question

I have a fairly reasonably sized dataset (row>50k). And I'm looking for the best way to utilize some of the categorical columns. For purpose of this question, let's say that one of the categorical column is zipcode. The premise is, after feature engineering, I'll pass this data to a random forest regressor in sklearn, which does not recognize categorical columns.

Let's say I have 500 unique zipcode. I could one-hot encode these, or pick the top 100 and then one-hot encode those (fill the rest with "OTHER" for instance), but they all generate a large amount of dimensionality, which I wanted to avoid. Here's the new idea that I have and I want to verify it with the community.

Let's say after I do train-test split, I take the train set, and average the individual zipcode group by the real labels they have, so for instance in the raw data:

zipcode  label
zip10001 3
zip10001 2
zip10001 4
zip10002 1
zip10002 2
zip10010 7

after transform, becomes

zipcode  label zipcode_avg
zip10001 3     3
zip10001 2     3
zip10001 4     3
zip10002 1     1.5
zip10002 2     1.5
zip10010 7     7

while also creating a dictionary:

dzipavg = {
"zip10001": 3,
"zip10002": 1.5,
"zip10010": 7
}

And instead of one-hot encoding the zipcode column, I'll simply drop it. And for the test column, I would map the zipcode with the dict test.["zipcode_avg"] = test.zipcode.map(dzipavg), and drop the zipcode column as well.

Do you think this is a good idea? Will there be any consequences that I have not seen? I don't think there's any data leak in here as all transformation is based on training data.

This method does in fact lead to data leakage and overfitting (high bias), because you are using target/label data to create features for your model. This can be addressed with regularization. — AlexK, Apr 26 '19 at 22:04
@AlexK I can't see how this lead to data leakage, no information is taken from the testing set of the split. — Rocky Li, Apr 26 '19 at 22:14
Data leakage in broad terms is defined as this: "if any feature whose value would not actually be available in practice at the time you'd want to use the model to make a prediction, it is a feature that can introduce leakage to your model." (http://dataskeptic.com/blog/episodes/2016/leakage) This target-based feature, in your case, would fall under that definition because you would not have target values at the time of testing. — AlexK, Apr 26 '19 at 22:23
@AlexK That's not true though, the whole purpose of the train test split is to assume we already have the information of the training set, which contains the target. If target(label) is not available, there would be nothing to train on in supervised learning model. — Rocky Li, Apr 27 '19 at 23:48
Look up what data leakage is. I am not talking about not using target values to train the model. I am talking about creating features that are based on target values. The model becomes more accurate than it should on the training data as a result. — AlexK, Apr 28 '19 at 02:32

score 3 · Accepted Answer · answered Apr 26 '19 at 21:06

This is known as target-based encoding, and for high-cardinality categorical variables (such as your example), this is a better option as compared to other encoding approaches.

One issue with target-based encoding is that some of the categories would have a very small number of samples in the training data, e.g., zipcodes with small population. This would make the average target (label) values for those small categories unstable. This leads to over-fitting, which would negatively impact the predictive accuracy of the model.

One way to avoid this is to coalesce categories that have similar target rates. You can run two-sample means comparison tests (aka t-tests) among all zipcodes, and then combine the zipcodes that have target rates that are statistically not different. For example, if zipcodes 23233 and 23060 have statistically insignificant difference in their average target rates, then you would combine those two zipcodes into one group and calculate a combined target rate for this new group. You can perform several such iterations until you find groups of zipcodes that are statistically distinct from each other (in terms of their average target rates.)

Alternatively, you can build a decision tree using zipcode as the independent variable and your outcome (label) as the dependent variable. The tree can be built using CART, which grows tree by using binary splits. Once the tree is grown (and pruned appropriately), you can use the leaf nodes to determine which zipcodes should be grouped together. The target rates in each leaf node are your transformation values. You can also build this tree using CHAID (Chi-Square Automatic Interaction Detection), which can produce multiple branches of a parent node.

See also https://stats.stackexchange.com/questions/398903/strange-encoding-for-categorical-features/414892#414892 — kjetil b halvorsen, Oct 04 '19 at 10:17

One hot encoding vs apply the average of the label to each category

1 Answers1