4

Generally, machine learning methods make little to no statistical assumptions. However, a key assumption they do make is that the data are IID.

What are the implications of a violation of the independence of the labels assumption for sequential gradient boosting? To be concrete, a Gradient Boosted Decision Tree.

In ensembles that exclusively use bagging (e.g., random forest), it is very easy to see what happens: the variance reduction from bagging results from the fact that the errors of the base learners are uncorrelated (say a white noise); the expected value of noise is zero and the noise diversifies away. If the independence assumption is violated in random forest, then you will see that the out-of-bag (OOB) error is unreasonably low (because the samples are correlated) and inconsistent with the validation error.

Is there a similar thought experiment one can do to make conclusions about what happens in GBDTs when the data have non-independent labels?

Jonathan
  • 441

0 Answers0