9

It is known that when constructing a decision tree, we split the input variable exhaustively and find the 'best' split by statistical test approach or Impurity function approach.

My question is when we use a continuous variable as the input variable (only a few duplicated values), the number of possible splits could be very large, to find the 'best' split will be time-consuming. How would data scientist deal with it?

I have read some materials that people would do a clustering of levels of the input to limit the possible splits. (example). However, they don't explain how it is done. What do we base on to cluster an univariate variable? Are there any resources for more details or anyone can explain in details?

Thanks!

pe-perry
  • 842
  • There is not one algorithm to train a random forest but many. For example ID3, C4.5, CART, CHAID or MARS. The answer to your question heavily depends on the used algorithm... – MaxBenChrist Feb 16 '16 at 08:53
  • @MaxBenChrist Would you mind picking one to two of them, e.g. CART to explain how the input variable are clustered? Thanks! – pe-perry Feb 16 '16 at 10:09
  • The algorithms would split by bins/intervals and find the point which give the most greedy results. – SmallChess Feb 06 '18 at 10:26

1 Answers1

11

The common method is to check only certain bins as splitting point / threshold. I think this is what the author of the presentation you posted is referring to. Lets say you have a continuous input random variable $X$ with the 10 samples

[1,3,4,6,2,5,18,10,-3,-5]

Probably you do not check every value of $X$ from the 10 observed values as a splitting point. Instead you would for example calculate just check the 20%, 40%, 60%, 80% quantile from your data. So you order your data

[-5,-3,1,2,3,4,5,6,10,18]

and "cluster" your data into bins

[-5,-3],[1,2],[3,4],[5,6],[10,18]

So then you would only have to check -1,2.5,4.5, and 8 as possible splitting point (you linearly interpolate between the bins)

The following paper is comparing three rules on how to choose the splitting points to test. I think it is what you are searching.

@article{chickeringefficient, title={Efficient Determination of Dynamic Split Points in a Decision Tree}, author={Chickering, David Maxwell and Meek, Christopher and Rounthwaite, Robert} }