While what is "obvious" is a matter of perspective, I like to think gradient boosting in the following manner:
Through the GBM we are learning "corrections"/acquiring more experience. With every repetition of the modelling task (i.e. iteration) we are learning how to predict mistakes we have done up to that point. As such when we use all of our "experiences" (base learners) we get the best possible insights/outcome on a particular task. We gradually learn our model.
We can view this mathematically as having an ever-diminishing error by using a slightly modified backfitting algorithm. Boosting can be presented as a generalised additive model (GAM) (See Hastie et al. 2009, Elements of Statistical Learning Chapt. 10.2 "Boosting Fits an Additive Model" for more details.)
Therefore we can say that in the $J+1$ iteration of the algorithm we model the quantity $y^* = y - \sum_{j=1}^{J} \hat{f_j}(X)$, i.e. our error up to the $J$-th iteration; here $y$ is our data at hand and $\hat{f_j}$ is the base-learner we learned during the $j$-th iteration. As such in every iteration we use the structure of residuals (our errors) to update our model. How much of that structure we will be incorporating depends on our learning rate.
Minor points:
- we can assume that $\hat{f}_{j=1} = 0$ or $\hat{f}_{j=1} = E\{y\}$ as in either case after the first few dozen iterations the difference will be nominal.
- if the new $y$, $y^*$ is completely unstructured and there nothing learnable, we will not update our fit meaningfully. This is in direct analogy with our view of learning a model gradually, if we get no new information (e.g. we over-estimate our estimates on a particular range of explanatory variable $X_p$), we increase our knowledge on a matter. :)
I would suggest looking into Hastie et al. 2009, Elements of Statistical Learning Chapt. 9 "Additive Models, Trees, and Related Methods" because it shows how an additive model works (Sect. 9.1 & 9.2 should be enough). After that, the extension to GBMs is clear.