Why is there no improvement when training Xgboost with pseudo-Huber loss?

Question

In this StackOverflow post I asked if there was something wrong with my syntax when training an XGboost model (in R) with the native pseudo-Huber loss reg:pseudohubererror, since nor training or test error improve (remain constant). There doesn't seem to be a syntax error since custom objectives such as log-cosh loss also shows the same effect.

I am interested in understanding why it doesn't work, since training with absolute loss is a fairly popular thing, due to insensitivity to outliers, and could therefore do better than squared loss - and it is a native argument, so it must be good for something, right?. Does it have something to do with the fact that xgboost requires both a gradient and a hessian? In what context (datatype), if at all, would it work?

So far I couldn't find any example where xgboost with huber-loss is used in a concrete learning problem.

Here's the code from the post above as reference:

Code:

    library(xgboost)
    n = 1000
    X = cbind(runif(n,10,20), runif(n,0,10))
    y = X %*% c(2,3) + rnorm(n,0,1)
train = xgb.DMatrix(data  = X[-n,],
                    label = y[-n])

test = xgb.DMatrix(data   = t(as.matrix(X[n,])),
                   label = y[n]) 

watchlist = list(train = train, test = test)

xbg_test = xgb.train(data = train, objective = &quot;reg:pseudohubererror&quot;, eval_metric = &quot;mae&quot;, watchlist = watchlist, gamma = 1, eta = 0.01, nrounds = 10000, early_stopping_rounds = 100)

Result:

    [1] train-mae:44.372692 test-mae:33.085709 
    Multiple eval metrics are present. Will use test_mae for early stopping.
    Will train until test_mae hasn't improved in 100 rounds.
[2] train-mae:44.372692 test-mae:33.085709 
[3] train-mae:44.372688 test-mae:33.085709 
[4] train-mae:44.372688 test-mae:33.085709 
[5] train-mae:44.372688 test-mae:33.085709 
[6] train-mae:44.372688 test-mae:33.085709 
[7] train-mae:44.372688 test-mae:33.085709 
[8] train-mae:44.372688 test-mae:33.085709 
[9] train-mae:44.372688 test-mae:33.085709 
[10]    train-mae:44.372692 test-mae:33.085709

Have you tried implementing mean squared error as a custom objective function? — wdkrnls, Aug 27 '21 at 22:30
@wdkrnls: Yes I have, to check if the rmse corresponds to the native function - and it did as far as I remember. I have tried multiple custom objective functions and they all improved, except the Huber loss. — PaulG, Aug 29 '21 at 08:33
That's good. I thought in the other question you had tried to use log_cosh and didn't have much luck with that either. I tried that one myself and couldn't get it to work, so you beat me there. — wdkrnls, Aug 30 '21 at 15:04
@wdkrnls: Sorry, I meant huber-loss and related functions such as log-cosh did not improve (log-cosh and I believe 2 other huber-loss implementiations were the only custom ones I tried). Other custom losses (not Huber related) I used like rmse, mse, earth-mover-distance-based losses etc. behaved as expected. — PaulG, Aug 31 '21 at 16:41
+1 fun question. A pseudohubererror objective is slightly more involved than it might seem first and can require some additional tuning, please see my answer below for more details. — usεr11852, Nov 30 '23 at 01:59

usεr11852 · Answer 1 · 2023-11-30T02:07:09.260

There is no definite answer at this but I would note one major and one minor point:

The major point is that: A XGBoost booster starts with a base_score. That is the initial prediction score of all instances and given an adequate number of boosting iterations has been achieved, it has relatively small effect. That said, to a hard problem where the initial prediction might be way off a reasonable starting point, the whole method might get stuck. I would suggest trying different "base scores". In the example given, entering base_score=45.0 ($45$ being a round number close to the training set's median response value here) leads to the learner starting to have reasonable learning path.

It makes our learning path to look a bit like this:

[1] train-mae:8.072581  test-mae:6.321724 
Multiple eval metrics are present. Will use test_mae for early stopping.
Will train until test_mae hasn't improved in 100 rounds.
[2] train-mae:7.651685  test-mae:5.641270 
[3] train-mae:7.228202  test-mae:5.145817 
[4] train-mae:6.848772  test-mae:4.616982 
(...)
[423]   train-mae:0.571731  test-mae:1.097401 
[424]   train-mae:0.571609  test-mae:1.097115 
Stopping. Best iteration:
[324]   train-mae:0.589210  test-mae:1.096233

The minor point is that: The pseudo-Huber loss function itself is parametrised by $\delta$, this what XGBoost refers as huber_slope. The derivative of our objective function approximates a straight line with slope $\delta$ for large values of our residuals but important it also approximates $\frac{a^{2}}{2}$ for small values of our residuals. So while yes, $\delta=1$ makes our function look like MAE "a lot" for large residuals values, it is the "small residuals" that actually inform our gradient step. And $\frac{1}{2}$ might be very large value leading our learner to overshoot. This parameter on it's own is not as impactful as base_score but it can help us get lower values. In the example given, after entering base_score=45.0 we can also change huber_slope=0.1 and thus get even more competitive MAE values.

And thus our learning path to look a bit like this now:

[1] train-mae:8.406398  test-mae:7.042973 
Multiple eval metrics are present. Will use test_mae for early stopping.
Will train until test_mae hasn't improved in 100 rounds.
[2] train-mae:8.313732  test-mae:6.996540 
[3] train-mae:8.238347  test-mae:6.948215 
[4] train-mae:8.171287  test-mae:6.907307 
(...) 
[263]   train-mae:1.274389  test-mae:0.244793 
[264]   train-mae:1.270874  test-mae:0.244984 
Stopping. Best iteration:
[164]   train-mae:2.070399  test-mae:0.018089

(Notice that the initial boosting rounds have higher test-mae too as our large residuals/errors are less influential than before in those early rounds.)

As a final comment, the best test-mae when using the standard squared error loss (objective = "reg:squarederror") is 1.623917 so we indeed do better in both runs in terms of MAE when using objective = "reg:pseudohubererror".

Why is there no improvement when training Xgboost with pseudo-Huber loss?

1 Answers1

Linked