3

In problems where one of $3+$ categories can be observed and we prodict the probability of each category being observed, it is known that the Brier score is a strictly proper scoring rule that is uniquely optimized in expected value by the true probability values [1, 2, 3, 4]. The machine learning community often refers to these problems with $3+$ categories as "multiclass" problems.

In contrast, "multi-label" problems allow for all, none, or any combination of categorical outcomes to be observed, and we model the probability of each individual outcome, possibly with relationships between the outcomes (e.g., if there is a horse in a photo, there probably isn't an airplane).

For a multi-label outcome, is the Brier score still a strictly proper scoring rule?

Dave
  • 62,186
  • I'm sure I'll post the same question about log loss (which I would link here). If an answer here wants to address both, that would not be unappreciated (especially if there is an interesting reason why the answers differ). – Dave Nov 20 '23 at 15:11
  • "it is known that the Brier score is a strictly proper scoring rule that is uniquely maximized in expected value by the true probability values". Reading the wikipedia definition here gives me a largely different impression. Suppose my sample is $x = [1,1,1,0,0,0]$ drawn iid from Bernoulli 0.5 RV. The Brier score is "uniquely maximized" if it is exactly $Q = [1,1,1,0,0,0]$, not $[0.5, 0.5, 0.5, 0.5, 0.5, 0.5]$ or "the true probability values" as you say. – AdamO Nov 20 '23 at 15:16
  • @AdamO I have included links discussing why Brier score is a strictly proper scoring rule. – Dave Nov 20 '23 at 15:23
  • I have no concern at all that Brier Score is a proper scoring rule. In fact, Dr. Lumley's response in Link 3 perfectly backs up why I believe your interpretation of the optimality condition is not correct - the best score is not the actual probabilities, it's just the result itself. If you accept that, it is trivial (in my opinion) to assert that the Brier score is still strictly proper in the multiclass setting. – AdamO Nov 20 '23 at 16:25
  • 1
    @AdamO What would you say defines a strictly proper scoring rule? What I'm saying seems to come straight from the Gneiting/Raftery 2007 JASA paper. – Dave Nov 20 '23 at 16:33
  • A strictly proper scoring rule is one where only the sample itself provides the optimal scoring value. – AdamO Nov 20 '23 at 16:36
  • @AdamO That seems to deviate from Gneiting and Raftery, even just the abstract. Do you have a way to reconcile your definition with theirs? – Dave Nov 20 '23 at 16:41
  • 1
    I think my answer to this question explains how quadratic loss/Brier score is (strictly) proper in the multi-label setting. Maybe you mean some specific situation which was not addressed there? – picky_porpoise Nov 21 '23 at 08:09
  • @picky_porpoise It bothers me to have to transform to a multi-class problem on the power set of pssible labels. Perhaps that argument is valid, though. I need to think about it more. I think I can buy it when the labels are independent. If the labels are not independent, then I have reservations. – Dave Nov 22 '23 at 19:54
  • @Dave To define a scoring function for this setting, you necessarily have to specify a set of possible forecasts. I doubt that one can find a much simpler set other than the two options mentioned there (and on Wikipedia ) – picky_porpoise Nov 22 '23 at 20:15
  • @picky_porpoise The trouble that I have with doing multi-class classification on the power set is that, if the prediction is a huge probability of ${A, B}$ when the true category is ${B}$, that seems to penalize the model immensely. However, the model did a good job to predict that $B$ would be present, just a bad job of predicting that $A$ would be present. – Dave Nov 22 '23 at 20:22
  • @Dave Maybe specify in your question then, that you are looking for a way to define a proper multi-label Brier score which doesn't have this feature. – picky_porpoise Nov 23 '23 at 13:09

0 Answers0