2

For the input training set ${ \{ ({ x }_{ i }{ y }_{ i })\} }_{ i=1 }^{ n }$ if the loss function is L(y, f(x)), then we initialize the model $M_0$ by finding the $\gamma$ which minimizes: $$ F_0(x) = \sum _{ i=1 }^{ n }{ L{ (y }_{ i, } } \gamma ) $$

which means, for every 'x' we define a model which always gives a constant value $\gamma$

So, now, in the 1st iteration, how come we are able to calculate the derivate of Loss function with respect to the previous model's function, (which is a constant $\gamma$), as derivates with respect to constant are not defined.

Can anyone explain what I'm understanding wrong over here ?

1 Answers1

1

I think your understanding is mostly fine. :)

I alluded to what happens during the first boosting iteration in the earlier question: Why we fit xᵢ vs errorᵢ in Gradient Boosting. The first iteration is consider to start either from the 0 or the mean value of the response variable. Some packages (e.g. LightGBM) even go as far as to provide a boost_from_average option. That being said, the derivatives themselves are not with respect to a constant because they are defined through the residuals of the loss function for a particular point. Simply put, the gradient for the $i$ point for the $m$-th iteration is $g_{im} = [ \frac{\partial L(y_i, f(x_i)}{\partial f(x_i)}]$, i.e. we care for the loss $L$ wrt. to $f$ which is obviously not constant (even when making the first iteration).

I appreciate when Hastie et al. (2009) (Sect. 10.9 "Boosting trees") say: "A constant $\gamma_j$ is assigned to each such region and the predictive rule is $x \in R_j \rightarrow f(x) = \gamma_j$" where $R_j$ are disjoint regions of the space of all joint predictor variable values, it might seem that we have a "constant loss" but that is not the case. Constant here refers to the concept that trees perform recursive space partitioning assigning a constant value to each of their leaf nodes (i.e. our predictions from a single tree).

usεr11852
  • 44,125
  • Okay ! So, it would have been constant ( $\gamma$) if the function $F_0(x)$ would be for predicting the output y; but since $F_0(x)$ is for predicting the errors, and not the output, $F_0(x)$ will not give a constant. Right ? – Saurabh Verma Sep 14 '19 at 11:46
  • Yes, you are correct, $F_0(x)$ is not going to be a constant. – usεr11852 Sep 14 '19 at 12:19