why is XGBoost giving me seriously biased predictions with small nrounds?

Question

I can not put my company data online, but I can provide a reproducible example here.

We're modelling Insurance's frequency using Poisson distribution with exposure as offset.

Here in this example, we want to model the number of claim Claims ($y_i$) with exposure Holders ($e_i$)

In the traditional GLM model, we can dirrectly model $y_i$ and put $e_i$ in the offset term. This option is not available in xgboost. So the alternative is to model the rate $\frac{y_i}{e_i}$, and put $e_i$ as a the weight term (reference)

When I do that with a lot of iteractions, the results are coherent ($\sum y_i = \sum \hat{y_i}$). But it is not the case when nrounds = 5. I think that the equation $\sum y_i = \sum \hat{y_i}$ must be satisfied after the very first iteration.

The following code is an extreme example for the sake of reproducibility. In my real case, I performed a CV on the training set (optimizing MAE), I obtained nrounds = 1200, training MAE = testing MAE. Then I re-run a xgboost on the whole data set with 1200 iteration, I see that $\sum y_i \ne \sum \hat{y_i}$ by a large distance, this doesn't make sense, or am I missing something?

So my questions are:

Am I correctly specify parameters for Poisson regression with offset in xgboost?
Why such a high bias at the first iterations?
Why after tuning nrounds using xgb.cv, I still have high bias?

Here is the graphics plotting the ratio $\frac{\sum \hat{y_i}}{\sum y_i}$ by nrounds

Code edited after the comment of @JonnyLomond

library(MASS)
library(caret)
library(xgboost)
library(dplyr)

#-------- load data --------#
data(Insurance)

#-------- data preparation --------#

#small adjustments
Insurance$rate = with(Insurance, Claims/Holders)
temp<-dplyr::select(Insurance,District, Group, Age, rate)
temp2= dummyVars(rate ~ ., data = temp, fullRank = TRUE) %>% predict(temp)

#create xgb matrix
xgbMatrix <- xgb.DMatrix(as.matrix(temp2), 
                         label = Insurance$Claims)
setinfo(xgbMatrix, "base_margin",log(Insurance$Holders))


#-------------------------------------------#
#      First model with small nround
#-------------------------------------------#
bst.1 = xgboost(data    = xgbMatrix, 
                objective ='count:poisson', 
                nrounds   = 5)


pred.1 = predict(bst.1, xgbMatrix)
sum(Insurance$Claims) #3151
sum(pred.1) #12650.8 fails



#-------------------------------------------#
#      Second model with more iteractions
#-------------------------------------------#

bst.2 = xgboost(data    = xgbMatrix, 
                objective = 'count:poisson', 
                nrounds   = 100)


pred.2 = predict(bst.2, xgbMatrix)
sum(Insurance$Claims) #3151
sum(pred.2) #same

Why would you expect any algorithm to converge at first iteration..? — Tim, Jan 25 '18 at 11:09
@Tim : I expect the nrounds = 1 to be equivalent to fitting a tree on the data, so sum of observations should be equal to sum of predictions. Am I missing something? — Metariat, Jan 25 '18 at 11:10
perhaps "converge" is not the right word, but I hope the question is clear. — Metariat, Jan 26 '18 at 08:54
If fitting single tree worked out-of-the-box, we wouldn't use random forests and boosting at all, don't you think? — Tim, Jan 26 '18 at 08:58
I took an extreme example in the question. In my real case, I use xgb.cv to select nrounds equal to ~ 1200 (training and testing mae inside the training set are almost equal). But when I fit xgboost with 1200 iterations and check out the sum of the predictions, it is not equal to the sum of observations. And the MAE in the real test set is actually very high compare to the MAE in the CV test set. — Metariat, Jan 26 '18 at 09:04
Why should it be equal? The discrepancy between CV and hold-out test set may suggests problems with your training/CV data that differs from the test set. — Tim, Jan 26 '18 at 09:06
@Tim I'm not sure what do you mean by "out-of-the-box" but fitting single tree is equal to split out the date set into areas, and take the mean (for ex) of each area as predictions. So sum of predictions should be almost equal to sum of observations. — Metariat, Jan 26 '18 at 09:13
What you apparently mean to ask is "why is XGBoost giving me seriously biased predictions with nrounds = 1. Might want to reword the question and title with that in mind, since convergence has little to do with it. — jbowman, Jan 26 '18 at 17:04

score 6 · Answer 1 · answered Jan 26 '18 at 15:21

6

First a few technical things:

You can use an offset in xgboost for Poisson regression, by setting the base_margin value in the xgb.DMatrix object.
You will not get the same results with your above code as if you use the base_margin term. (You get the same results in a GLM, but this is not a GLM. I think the weight term in xgboost means something different.)

For your question:

Sum of predictions will not equal sum of observations after a small number of rounds for several reasons:

Xgboost is regularizing predictions in the nodes (shrinking them toward 0). This will happen by default
Xgboost is scaling predictions from each tree (by a positive number less than 1). This will happen by default
Xgboost is randomly sampling rows every round, not fitting on the whole data set. I think this will not happen by default.
Some other behaviors.

Basically: It is not true that fitting one round in xgboost is the same as fitting a basic decision tree in the usual way.

answered Jan 26 '18 at 15:21

Jonny Lomond

1,323

The first point 2: I have the same doubt as well, can you give more details? For the rest: I completely agree with you, but those arguments only explain the fact that the ratio $\frac{\sum \hat{y_i}}{\sum y_i} \ne 1$ but should be close to 1. I have this ratio equal to 1.5 after 1200 iterations. – Metariat Jan 26 '18 at 15:39
2

I dont think the result is even true for a single unregularized tree. The op is generalizing results about glms with canonical links to domains where they do not apply. The trees in gradient boosting are not even fit to the response. – Matthew Drury Jan 26 '18 at 15:40
@MatthewDrury thanks for you suggestions. But I still don't get it why with nrounds = 1200 obtained by minimizing CV MAE, this ratio is still 1.5? – Metariat Jan 26 '18 at 16:17
1

Are you adjusting your predictions for exposure? If not, what is your average exposure? – Matthew Drury Jan 26 '18 at 16:20
Yes, I adjust my predictions for exposure, when I try to increase nrounds to 2000, the equation is satisfied, but not with nrounds = 1200 found by CV. My average exposure is about 0.4 – Metariat Jan 29 '18 at 08:59
@metariat ,I do not see the 1.5 for 1200 rounds in your question. This makes your example unclear, what is the problem and can you relate it to your example? Do you want to understand the 1 starting point 2 conversion rate 3 conversion failure or 4 how the xgboost computes? – Sextus Empiricus Jan 29 '18 at 14:56

why is XGBoost giving me seriously biased predictions with small nrounds?

1 Answers1