Why we fit xᵢ vs errorᵢ in Gradient Boosting

Question

The basic idea of Boosting is to reduce bias by reducing training error in multiple iterations. However, I'm unable to understand how does combining multiple models which are trained by fitting current iteration's x to the previous iteration's model's errors reduces overall training error ?

Is there any mathematical proof or an obvious intuition here which I'm missing ?

Think of it as learning "corrections"/acquiring more experience. Every time (iteration) we are learning how to predict mistakes we have done up to that point. As such when we use all of our "experiences" we get the best possible insights/outcome on a particular task. — usεr11852, Sep 10 '19 at 12:32
Okay, but can we also explain this mathematically by incorporating the formula for boosting, viz ∑ᵢhᵢ(x) where i ranges from 0 to number of iterations ? I mean — Saurabh Verma, Sep 10 '19 at 13:00
Please see my answer below. The mathematical interpretation of boosting is closely related to (generalised) additive models; I tried to expand on it a bit. — usεr11852, Sep 10 '19 at 14:04

score 2 · Accepted Answer · answered Sep 10 '19 at 14:03

While what is "obvious" is a matter of perspective, I like to think gradient boosting in the following manner: Through the GBM we are learning "corrections"/acquiring more experience. With every repetition of the modelling task (i.e. iteration) we are learning how to predict mistakes we have done up to that point. As such when we use all of our "experiences" (base learners) we get the best possible insights/outcome on a particular task. We gradually learn our model.

We can view this mathematically as having an ever-diminishing error by using a slightly modified backfitting algorithm. Boosting can be presented as a generalised additive model (GAM) (See Hastie et al. 2009, Elements of Statistical Learning Chapt. 10.2 "Boosting Fits an Additive Model" for more details.) Therefore we can say that in the $J+1$ iteration of the algorithm we model the quantity $y^* = y - \sum_{j=1}^{J} \hat{f_j}(X)$, i.e. our error up to the $J$-th iteration; here $y$ is our data at hand and $\hat{f_j}$ is the base-learner we learned during the $j$-th iteration. As such in every iteration we use the structure of residuals (our errors) to update our model. How much of that structure we will be incorporating depends on our learning rate. Minor points:

we can assume that $\hat{f}_{j=1} = 0$ or $\hat{f}_{j=1} = E\{y\}$ as in either case after the first few dozen iterations the difference will be nominal.
if the new $y$, $y^*$ is completely unstructured and there nothing learnable, we will not update our fit meaningfully. This is in direct analogy with our view of learning a model gradually, if we get no new information (e.g. we over-estimate our estimates on a particular range of explanatory variable $X_p$), we increase our knowledge on a matter. :)

I would suggest looking into Hastie et al. 2009, Elements of Statistical Learning Chapt. 9 "Additive Models, Trees, and Related Methods" because it shows how an additive model works (Sect. 9.1 & 9.2 should be enough). After that, the extension to GBMs is clear.

Why we fit xᵢ vs errorᵢ in Gradient Boosting

1 Answers1

Linked