Measure the goodness-of-fit in boosted regression tree

Question

What is the apropriate statistic to measure the goodness-of-fit in Boosted Regression Tree (or Gradient Boosting Regression) with continuous response? How can I calculate the coefficient of determination (R²) in the train and test data? If I calculate the R² as bellow, How can I calculate the intercept-only model?

R² = 1−L1/L0, where L1 and L0 are the log likelihoods of the model under consideration and an intercept-only model, respectively (see http://www.stata-journal.com/sjpdf.html?articlenum=st0087).

I'm using the package "dismo" in R, so if any one have a solution in R it will be great.

Example with binary data just to show the procedure:

library(dismo)

data(Anguilla_train)

angaus.tc5.lr005 <- gbm.step(data=Anguilla_train, gbm.x = 3:13, gbm.y = 2, family = "bernoulli", tree.complexity = 5, learning.rate = 0.005, bag.fraction = 0.5 , keep.fold.models = TRUE, keep.fold.vector = TRUE, keep.fold.fit = TRUE)

Thank you in advance!

Welcome. Is there any piece of data you can provide as a reproducible example? — Andre Silva, Nov 19 '13 at 13:45
How was your question solved? How is a R2 equivalent calculated for a GBM model with a continous response? — Johannes, Aug 01 '17 at 21:22

score 1 · Answer 1 · answered Jul 08 '21 at 14:23

Comment: Goodness of fit is easy to get wrong. Getting it right usually involves the statistician asking variations of "but why do you want it" about 20 times until the nature of the use and enough understanding is extracted from the one asking the question to nail down what the measure of goodness should be.

Your questions:

What is the apropriate statistic to measure the goodness-of-fit in gbm's with continuous response?
How can I calculate the (R²) in the train and test data?
If I calculate the [pseudo-] R² in the stata reference, How can I calculate the intercept-only model?

Analysis: When I run this code (a very slight variation on yours):

require(pacman)
p_load(dismo,gbm)
data(Anguilla_train)
angaus.tc5.lr005 <- dismo::gbm.step(data=Anguilla_train, 
                                 gbm.x = 3:13, 
                                 gbm.y = 2, 
                                 family = "bernoulli", 
                                 tree.complexity = 5, 
                                 learning.rate = 0.005, 
                                 bag.fraction = 0.5 , 
                                 keep.fold.models = TRUE, 
                                 keep.fold.vector = TRUE, 
                                 keep.fold.fit = TRUE)

it outputs several decent measures of goodness. Any one of them could work. Someone who understands the business need, technical need, and application can help you pick one. The "ModelMetrics" library also has plenty of options that work here (see below).

fitting final gbm model with a fixed number of 1250 trees for Angaus
mean total deviance = 1.006 
mean residual deviance = 0.455
estimated cv deviance = 0.687 ; se = 0.023
training data correlation = 0.785 
cv correlation =  0.574 ; se = 0.021
training data AUC score = 0.961 
cv AUC score = 0.87 ; se = 0.011
elapsed time -  0.31 minutes

The R-squared related to the difference between predictive variance and raw variance, aka the fraction of unexplained variance (fvu), as: $$R^{2}=1-\frac{SS_{res}}{SS_{tot}} = 1 - fvu$$.

This is how you calculate it for the gbm (continous) response on training data. You would use test, and predict on it similarly.

We compute it for the continuous version using this code:

y_new <- Anguilla_train[,3]
num <- var(predict(new_model)-y_new)
den <- var(y_new)
R2 <- 1-(num/den)
print(R2)

The results were:

> R2 <- 1-(num/den)
> print(R2)
[1] 0.9657491

As I understand it the intercept-only is the raw variance. The best constant estimator is the mean, and the variance around the mean is the variance.

Measure the goodness-of-fit in boosted regression tree

1 Answers1

Linked