12

This is scikit GradientBoosting's binomial deviance loss function,

   def __call__(self, y, pred, sample_weight=None):
        """Compute the deviance (= 2 * negative log-likelihood). """
        # logaddexp(0, v) == log(1.0 + exp(v))
        pred = pred.ravel()
        if sample_weight is None:
            return -2.0 * np.mean((y * pred) - np.logaddexp(0.0, pred))
        else:
            return (-2.0 / sample_weight.sum() *
                    np.sum(sample_weight * ((y * pred) - np.logaddexp(0.0, pred))))

This loss functions is not similar between class with 0 and class with 1. Can anyone explain how this is considered OK.

For example, with no sample weigth, the loss function for class 1 is

-2(pred - log(1 + exp(pred))

vs for class 0

-2(-log(1+exp(pred))

The plot for these two are not similar in terms of cost. Can anyone help me understand.

Matthew Drury
  • 35,629
Kumaran
  • 175
  • 1
  • 1
  • 8

1 Answers1

20

There are two observations needed to understand this implementation.

The first is that pred is not a probability, it is a log odds.

The second is a standard algebraic manipulation of the binomial deviance that goes like this. Let $P$ be the log odds, what sklearn calls pred. Then the definition of the binomial deviance of an observation is (up to a factor of $-2$)

$$y \log(p) + (1-y) \log(1 - p) = \log(1 - p) + y \log \left( \frac{p}{1-p} \right)$$

Now observe that $p = \frac{e^{P}}{1 + e^{P}}$ and $1-p = \frac{1}{1 + e^{P}}$ (a quick check is to sum them in your head, you'll get $1$). So

$$\log(1-p) = \log \left( \frac{1}{1 + e^{P}} \right) = - \log(1 + e^{P}) $$

and

$$ \log \left( \frac{p}{1-p} \right) = \log ( e^{P} ) = P $$

So altogether, the binomial deviance equals

$$y P - \log( 1 + e^{P} )$$

Which is the equation sklearn is using.

davalo
  • 3
Matthew Drury
  • 35,629
  • Thanks you. If i replace pred with log odds, the loss function is uniform for both the classes. – Kumaran Jun 21 '15 at 06:11
  • This same question came up for me recently. I was looking at https://gradientboostedmodels.googlecode.com/git/gbm/inst/doc/gbm.pdf page 10 where the gradient of the deviance is listed. But it seems like the gradient they show is for the log-lik not the negative log-lik. Is this correct - it seems to match your explanation here? – B_Miner Mar 29 '16 at 17:48
  • 1
    @B_Miner the link is broken – Fenil Jun 30 '18 at 11:43
  • Are you sure pred is not the predicted score (instead of the log odds of the predicted score)? Otherwise, sounds confusing that scikit-learn did not name that variable log_odds... – Tanguy Jul 22 '21 at 21:41