Selection bias in trees

Question

In Applied Predictive Modeling by Kuhn and Johnson the authors write:

Finally, these trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih, 1997; Carolin et al., 2007; Loh, 2010). Loh and Shih ( 1997) remarked that “The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”

Kuhn, Max; Johnson, Kjell (2013-05-17). Applied Predictive Modeling (Kindle Locations 5241-5247). Springer New York. Kindle Edition.

They go on to describe some research into building unbiased trees. For example Loh's GUIDE model.

Staying as strictly as possible within the CART framework, I'm wondering if there's anything I can do to minimize this selection bias? For example, perhaps clustering/grouping high cardinality predictors is one strategy. But to what degree should one do the grouping? If I have a predictor with 30 levels should I group to 10 levels? 15? 5?

Keep in mind that CART is not just biased against factors with many levels, but potentially continuous variables too if your sample size is large. Is there a particular reason you want to stay within the CART framework? In addition to GUIDE, conditional inference trees are another option to avoid selection bias. — dmartin, Jan 06 '16 at 01:45
My impression is that there is more off-the-shelf code written for CART and in addition, I want to keep things simple to explain. — dal233, Jan 06 '16 at 19:13
When I said "off the shelf code written for CART" - I also meant the whole ecosystem around CART. Like for example rpart.plot. — dal233, Jan 07 '16 at 22:02
?ctree and you'll see the party package has many of the same features that rpart does. Missing data is handled via surrogate splits as well — dmartin, Jan 07 '16 at 22:29
I tried to use party with an rpart object, but found rpart.plot a lot easier to use for plotting and presentation purposes. — dal233, Jan 09 '16 at 00:30
You can somewhat mitigate this problem by using random forests and using feature subsampling. — rinspy, Feb 02 '18 at 10:00

score 2 · Answer 1 · answered Jan 06 '16 at 20:20

Based on your comment I'd go with a conditional inference framework. The code is readily available in R using the ctree function in the party package. It has unbiased variable selection, and while the algorithm underlying when and how to make splits is different compared to CART, the logic is essentially the same. Another benefit outlined by the authors (see the paper here) is that you don't have to worry so much about pruning the tree to avoid overfitting. The algorithm actually takes care of that by using permutation tests to determine whether a split is "statistically significant" or not.

Selection bias in trees

1 Answers1