16

When it comes to decision trees, can the predicted value lay outside of the range of the training data?

For example, if the training data set range of the target variable is 0-100, when I generate my model and apply it to something else, can my values be -5? or 150?

Given that my understanding of decision tree regression is that it is still a rules based - left/right progression and that at the bottom of the tree in the training set it can never see a value outside a certain range, it will never be able to predict it?

user3788557
  • 1,629
  • 1
    For a similar question about gradient-boosted trees, see https://stats.stackexchange.com/questions/304962/is-is-possible-for-a-gradient-boosting-regression-to-predict-values-outside-of-t – Adrian Aug 08 '19 at 04:57

2 Answers2

19

You are completely right: classical decision trees cannot predict values outside the historically observed range. They will not extrapolate.

The same applies to random forests.

Theoretically, you sometimes see discussions of somewhat more elaborate architectures (botanies?), where the leaves of the tree don't give a single value, but contain a simple regression, e.g., regressing the dependent variable on a particular numerical independent variable. Navigating through the tree would give you a rule set on which numerical IV to regress the DV on in what case. In such a case, this "bottom level" regression could be extrapolated to yield not-yet observed values.

However, I don't think standard machine learning libraries offer this somewhat more complex structure (I recently looked for this through the CRAN Task Views for R), although there should really not be anything complex about it. You might be able to implement your own tree containing regressions in the leaves.

Stephan Kolassa
  • 123,354
  • I also assume that this applies to models such as K-nearest neighbors? – user3788557 Jan 13 '16 at 13:25
  • Yes. KNN looks at the $k$ nearest neighbors of your new data point (in terms of some distance measure), then takes the average of their DVs as a prediction. As an average of observations, the new prediction can never be outside the observations' range – Stephan Kolassa Jan 13 '16 at 13:35
  • 1
    I have sparsely read about mobForest which do support leaf regression in R, http://stats.stackexchange.com/questions/48475/mobforest-r-package – Soren Havelund Welling Jan 14 '16 at 07:20
  • 1
    @SorenHavelundWelling: that does sound interesting. Thanks for the pointer! – Stephan Kolassa Jan 14 '16 at 07:22
  • 1
    One of the first algorithms to provide linear regression models in the leaves of a tree was Quinlan's M5, an approximation of which is available in M5P() in Weka (interfaced in R through RWeka). An unbiased algorithm for the problem, called GUIDE, was first suggested by Loh. Binaries for his standalone package are on his website. Finally, our model-based (MOB) recursive partitioning algorithm encompasses various such models. It is available in the R package partykit: mob() is the generic tool and lmtree() and glmtree() are its adaptation to trees with (generalized) linear models in the leaves. – Achim Zeileis Jan 14 '16 at 23:50
  • 2
  • 2
    Just a heads up that mobForest is back on CRAN: https://cran.r-project.org/web/packages/mobForest/index.html – mkt Aug 09 '19 at 13:57
8

Also check out cubist in the caret package. It builds linear regressions in the terminal nodes and can extrapolate predictions above and below the range of response values in the training data. The terminal nodes can also be averaged based on nearest neighbors that is provided as a hyperparameter, so it has potential to provide extremely accurate cross validated predictions.

X. Zhao
  • 27
Scott Worland
  • 256
  • 2
  • 3