8

In Applied Predictive Modeling by Kuhn and Johnson the authors write:

Finally, these trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih,   1997;   Carolin et   al.,   2007;   Loh,   2010). Loh and Shih  ( 1997) remarked that “The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”

Kuhn, Max; Johnson, Kjell (2013-05-17). Applied Predictive Modeling (Kindle Locations 5241-5247). Springer New York. Kindle Edition.

They go on to describe some research into building unbiased trees. For example Loh's GUIDE model.

Staying as strictly as possible within the CART framework, I'm wondering if there's anything I can do to minimize this selection bias? For example, perhaps clustering/grouping high cardinality predictors is one strategy. But to what degree should one do the grouping? If I have a predictor with 30 levels should I group to 10 levels? 15? 5?

dal233
  • 81
  • 1
    Keep in mind that CART is not just biased against factors with many levels, but potentially continuous variables too if your sample size is large. Is there a particular reason you want to stay within the CART framework? In addition to GUIDE, conditional inference trees are another option to avoid selection bias. – dmartin Jan 06 '16 at 01:45
  • My impression is that there is more off-the-shelf code written for CART and in addition, I want to keep things simple to explain. – dal233 Jan 06 '16 at 19:13
  • When I said "off the shelf code written for CART" - I also meant the whole ecosystem around CART. Like for example rpart.plot. – dal233 Jan 07 '16 at 22:02
  • ?ctree and you'll see the party package has many of the same features that rpart does. Missing data is handled via surrogate splits as well – dmartin Jan 07 '16 at 22:29
  • I tried to use party with an rpart object, but found rpart.plot a lot easier to use for plotting and presentation purposes. – dal233 Jan 09 '16 at 00:30
  • You can somewhat mitigate this problem by using random forests and using feature subsampling. – rinspy Feb 02 '18 at 10:00

1 Answers1

2

Based on your comment I'd go with a conditional inference framework. The code is readily available in R using the ctree function in the party package. It has unbiased variable selection, and while the algorithm underlying when and how to make splits is different compared to CART, the logic is essentially the same. Another benefit outlined by the authors (see the paper here) is that you don't have to worry so much about pruning the tree to avoid overfitting. The algorithm actually takes care of that by using permutation tests to determine whether a split is "statistically significant" or not.

dmartin
  • 3,305