7

In this StackOverflow post I asked if there was something wrong with my syntax when training an XGboost model (in R) with the native pseudo-Huber loss reg:pseudohubererror, since nor training or test error improve (remain constant). There doesn't seem to be a syntax error since custom objectives such as log-cosh loss also shows the same effect.

I am interested in understanding why it doesn't work, since training with absolute loss is a fairly popular thing, due to insensitivity to outliers, and could therefore do better than squared loss - and it is a native argument, so it must be good for something, right?. Does it have something to do with the fact that xgboost requires both a gradient and a hessian? In what context (datatype), if at all, would it work?

So far I couldn't find any example where xgboost with huber-loss is used in a concrete learning problem.


Here's the code from the post above as reference:

Code:

    library(xgboost)
    n = 1000
    X = cbind(runif(n,10,20), runif(n,0,10))
    y = X %*% c(2,3) + rnorm(n,0,1)
train = xgb.DMatrix(data  = X[-n,],
                    label = y[-n])

test = xgb.DMatrix(data   = t(as.matrix(X[n,])),
                   label = y[n]) 

watchlist = list(train = train, test = test)

xbg_test = xgb.train(data = train, objective = "reg:pseudohubererror", eval_metric = "mae", watchlist = watchlist, gamma = 1, eta = 0.01, nrounds = 10000, early_stopping_rounds = 100)

Result:

    [1] train-mae:44.372692 test-mae:33.085709 
    Multiple eval metrics are present. Will use test_mae for early stopping.
    Will train until test_mae hasn't improved in 100 rounds.
[2] train-mae:44.372692 test-mae:33.085709 
[3] train-mae:44.372688 test-mae:33.085709 
[4] train-mae:44.372688 test-mae:33.085709 
[5] train-mae:44.372688 test-mae:33.085709 
[6] train-mae:44.372688 test-mae:33.085709 
[7] train-mae:44.372688 test-mae:33.085709 
[8] train-mae:44.372688 test-mae:33.085709 
[9] train-mae:44.372688 test-mae:33.085709 
[10]    train-mae:44.372692 test-mae:33.085709 

PaulG
  • 1,267
  • Have you tried implementing mean squared error as a custom objective function? – wdkrnls Aug 27 '21 at 22:30
  • 1
    @wdkrnls: Yes I have, to check if the rmse corresponds to the native function - and it did as far as I remember. I have tried multiple custom objective functions and they all improved, except the Huber loss. – PaulG Aug 29 '21 at 08:33
  • That's good. I thought in the other question you had tried to use log_cosh and didn't have much luck with that either. I tried that one myself and couldn't get it to work, so you beat me there. – wdkrnls Aug 30 '21 at 15:04
  • @wdkrnls: Sorry, I meant huber-loss and related functions such as log-cosh did not improve (log-cosh and I believe 2 other huber-loss implementiations were the only custom ones I tried). Other custom losses (not Huber related) I used like rmse, mse, earth-mover-distance-based losses etc. behaved as expected. – PaulG Aug 31 '21 at 16:41
  • +1 fun question. A pseudohubererror objective is slightly more involved than it might seem first and can require some additional tuning, please see my answer below for more details. – usεr11852 Nov 30 '23 at 01:59

1 Answers1

5

There is no definite answer at this but I would note one major and one minor point:

  1. The major point is that: A XGBoost booster starts with a base_score. That is the initial prediction score of all instances and given an adequate number of boosting iterations has been achieved, it has relatively small effect. That said, to a hard problem where the initial prediction might be way off a reasonable starting point, the whole method might get stuck. I would suggest trying different "base scores". In the example given, entering base_score=45.0 ($45$ being a round number close to the training set's median response value here) leads to the learner starting to have reasonable learning path.

It makes our learning path to look a bit like this:

[1] train-mae:8.072581  test-mae:6.321724 
Multiple eval metrics are present. Will use test_mae for early stopping.
Will train until test_mae hasn't improved in 100 rounds.

[2] train-mae:7.651685 test-mae:5.641270 [3] train-mae:7.228202 test-mae:5.145817 [4] train-mae:6.848772 test-mae:4.616982 (...) [423] train-mae:0.571731 test-mae:1.097401 [424] train-mae:0.571609 test-mae:1.097115 Stopping. Best iteration: [324] train-mae:0.589210 test-mae:1.096233

  1. The minor point is that: The pseudo-Huber loss function itself is parametrised by $\delta$, this what XGBoost refers as huber_slope. The derivative of our objective function approximates a straight line with slope $\delta$ for large values of our residuals but important it also approximates $\frac{a^{2}}{2}$ for small values of our residuals. So while yes, $\delta=1$ makes our function look like MAE "a lot" for large residuals values, it is the "small residuals" that actually inform our gradient step. And $\frac{1}{2}$ might be very large value leading our learner to overshoot. This parameter on it's own is not as impactful as base_score but it can help us get lower values. In the example given, after entering base_score=45.0 we can also change huber_slope=0.1 and thus get even more competitive MAE values.

And thus our learning path to look a bit like this now:

[1] train-mae:8.406398  test-mae:7.042973 
Multiple eval metrics are present. Will use test_mae for early stopping.
Will train until test_mae hasn't improved in 100 rounds.

[2] train-mae:8.313732 test-mae:6.996540 [3] train-mae:8.238347 test-mae:6.948215 [4] train-mae:8.171287 test-mae:6.907307 (...) [263] train-mae:1.274389 test-mae:0.244793 [264] train-mae:1.270874 test-mae:0.244984 Stopping. Best iteration: [164] train-mae:2.070399 test-mae:0.018089

(Notice that the initial boosting rounds have higher test-mae too as our large residuals/errors are less influential than before in those early rounds.)

As a final comment, the best test-mae when using the standard squared error loss (objective = "reg:squarederror") is 1.623917 so we indeed do better in both runs in terms of MAE when using objective = "reg:pseudohubererror".

usεr11852
  • 44,125