5

What is the apropriate statistic to measure the goodness-of-fit in Boosted Regression Tree (or Gradient Boosting Regression) with continuous response? How can I calculate the coefficient of determination (R²) in the train and test data? If I calculate the R² as bellow, How can I calculate the intercept-only model?

R² = 1−L1/L0, where L1 and L0 are the log likelihoods of the model under consideration and an intercept-only model, respectively (see http://www.stata-journal.com/sjpdf.html?articlenum=st0087).

I'm using the package "dismo" in R, so if any one have a solution in R it will be great.

Example with binary data just to show the procedure:

library(dismo)

data(Anguilla_train)

angaus.tc5.lr005 <- gbm.step(data=Anguilla_train, gbm.x = 3:13, gbm.y = 2, family = "bernoulli", tree.complexity = 5, learning.rate = 0.005, bag.fraction = 0.5 , keep.fold.models = TRUE, keep.fold.vector = TRUE, keep.fold.fit = TRUE)

Thank you in advance!

1 Answers1

1

Comment: Goodness of fit is easy to get wrong. Getting it right usually involves the statistician asking variations of "but why do you want it" about 20 times until the nature of the use and enough understanding is extracted from the one asking the question to nail down what the measure of goodness should be.

Your questions:

  1. What is the apropriate statistic to measure the goodness-of-fit in gbm's with continuous response?
  2. How can I calculate the (R²) in the train and test data?
  3. If I calculate the [pseudo-] R² in the stata reference, How can I calculate the intercept-only model?

Analysis: When I run this code (a very slight variation on yours):

require(pacman)
p_load(dismo,gbm)
data(Anguilla_train)
angaus.tc5.lr005 <- dismo::gbm.step(data=Anguilla_train, 
                                 gbm.x = 3:13, 
                                 gbm.y = 2, 
                                 family = "bernoulli", 
                                 tree.complexity = 5, 
                                 learning.rate = 0.005, 
                                 bag.fraction = 0.5 , 
                                 keep.fold.models = TRUE, 
                                 keep.fold.vector = TRUE, 
                                 keep.fold.fit = TRUE)

it outputs several decent measures of goodness. Any one of them could work. Someone who understands the business need, technical need, and application can help you pick one. The "ModelMetrics" library also has plenty of options that work here (see below).

fitting final gbm model with a fixed number of 1250 trees for Angaus

mean total deviance = 1.006 mean residual deviance = 0.455

estimated cv deviance = 0.687 ; se = 0.023

training data correlation = 0.785 cv correlation = 0.574 ; se = 0.021

training data AUC score = 0.961 cv AUC score = 0.87 ; se = 0.011

elapsed time - 0.31 minutes

The R-squared related to the difference between predictive variance and raw variance, aka the fraction of unexplained variance (fvu), as: $$R^{2}=1-\frac{SS_{res}}{SS_{tot}} = 1 - fvu$$.

This is how you calculate it for the gbm (continous) response on training data. You would use test, and predict on it similarly.

We compute it for the continuous version using this code:

y_new <- Anguilla_train[,3]

num <- var(predict(new_model)-y_new) den <- var(y_new)

R2 <- 1-(num/den) print(R2)

The results were:

> R2 <- 1-(num/den)
> print(R2)
[1] 0.9657491

As I understand it the intercept-only is the raw variance. The best constant estimator is the mean, and the variance around the mean is the variance.

EngrStudent
  • 9,375