Why R2 is used as the evaluation metric for scikit learn Gradient Boosting Regressor

Question

From the page for GBR (here) in scikit learn, the score use R2 as evaluation metric. As far as I know, R2 is majorly for linear regression. Why is it used even in gradient boosting regressor, which should be a non-linear model itself?

Moreoever, I've used the model for modelling some fluctuation of price. Despite the resultant R2 can be negative, the MAPE is actually below 10%, and the graph shows pretty accurate prediction to the data. The latter two pieces of evidence actually suggest my model is working. I wonder is it just R2 is not really fitting for GBR use or there are some other reasons that give these contradictory results?

Thanks.

p.s. this can be considered as a sister post to this

The practical reason is that scikit-learn uses $R^2$ as a default score function for all the regressors. It doesn't make it the best function for the job. — Tim, Dec 31 '22 at 18:36

Dave · Accepted Answer · 2022-12-30T15:53:22.283

Depending on how you define $R^2$, it can be equivalent to minimizing the sum of squared residuals. While this minimization is famous from OLS linear regression, nothing about such a minimization is special for linear models, and it is fine to minimize the sum of squared residuals in other regression models. It might even be fair to say it is common to minimize the sum of squared residuals for regression problems.

A typical way to define $R^2$ (that is used in your software) is as follows.

$$ R^2=1-\dfrac{ \sum_{i=1}^n\left( y_i-\hat y_i \right)^2 }{ \sum_{i=1}^n\left( y_i-\bar y \right)^2 } $$

Note that maximizing this value is equivalent to minimizing the numerator, which is exactly the sum of squares residuals. (The denominator is a constant that depends on the data, not the model. You get the same denominator whether you fit a linear model, a gradient boosting model, or a deep learning neural network.)

An advantage of $R^2$ is that it gives a context to the sum of squared errors by comparing to the performance of a baseline “must beat” model (despite my qualms with the exact software function, which I discuss in this link). A drawback of $R^2$ is that it lacks the natural units that are given by the squared errors, which can be interpreted. In your example, the developers evidently value the former in their model evaluation.

Thanks, and how about the contradictions that 'Despite the resultant R2 can be negative, the MAPE is actually below 10%, and the graph shows pretty accurate prediction to the data.'? — Student, Dec 31 '22 at 02:36
@Student What kind of MAPE do you get if you predict $\bar y$ every time? — Dave, Dec 31 '22 at 02:51
so what you suggest is whenever using MAPE, it's advisable to always compare the to the MAPE using only y¯ ? — Student, Dec 31 '22 at 03:48
It always makes sense to compare to some kind of baseline model. @Student — Dave, Dec 31 '22 at 04:13

Why R2 is used as the evaluation metric for scikit learn Gradient Boosting Regressor

1 Answers1