1

I recently stumbled over a generalisation of $F_1$ score to cases where the model predicts probabilities: $$ F_1 = 2 \frac{\sum y_i \hat{p}_i}{\sum y_i^2 + \sum \hat{p}_i^2} $$ where $y_i \in \{ 0, 1 \}$ are the true class memberships and $\hat{p}_i \in [0, 1]$ the predicted class probabilities.

$F_1$ score is usually expressed in terms of either precision and recall or of true and false positives and negatives. However, with just elementary algebra it can be re-expressed as: $$ F_1 = 2 \frac{\sum y_i \hat{y}_i}{\sum y_i + \sum \hat{y}_i} $$ with $\hat{y}_i \in \{ 0, 1 \}$ being the predicted class memberships.

I actually have two questions:

  1. Simply substituting $\hat{y}_i$ for $\hat{p}_i$ seems straightforward, but is it legitimate?

  2. Where do the squares in the first formula come from? Is it just because, for binary vectors, $\sum \hat{y}_i = \sum \hat{y}_i^2$, or is there a deeper theoretical justification?

I observed that, when using squares in the denominator, the generalised $F_1$ score reaches maximum when $\hat{p}$ is not too far from the true $p$ (in a way, approximating a proper scoring rule):

Generalised F_1 with squares in the denominator

while for non-squared values the maximum is reached when $\hat{p}$ is either zero or one:

Generalised F_1 score

(similar to Brier score vs. absolute loss). Using such scoring has no advantage over the original, binary $F_1$ score. So, are squares in the denominator just a hack to make generalised $F_1$ more useful in practice?

Igor F.
  • 9,089
  • 1
    Do you have a reference for this version of the F1 score? – dipetkov Oct 28 '22 at 09:04
  • @dipetkov It came up in a paper I recently reviewed, something about image segmentation using deep neural networks. I cannot share it, but, even if I could, it wouldn't help, as it provides no further details. – Igor F. Oct 28 '22 at 09:26
  • 1
    This seems relevant to question #1 (and probably should be cited by the paper you are reviewing?): https://aclanthology.org/2020.eval4nlp-1.9/ – dipetkov Oct 28 '22 at 10:10
  • If it appeared in a paper you reviewed, I assume you asked the authors to explain their proposed measure, and/or point to where they found it in the literature? Also, if (if!) the point was to approximate a proper scoring rule, why not use a proper scoring rule in the first place? It may well be that the authors of that paper tried exactly that, without knowing of the concept of a proper scoring rule. – Stephan Kolassa Oct 28 '22 at 14:16
  • @StephanKolassa Perhaps I should have, but I didn't, because 1) the measure did not appear obviously wrong to me (it still doesn't, but I'd like to know whether it's theoretically justified or just a clever hack) and 2) it was just a minor point in a paper focused on a different topic. I agree that it's a valid point whether a community should use different scores and whether it's a job of the reviewer to insist on it, but I think it should be discussed on Academia, not CrossValidated. – Igor F. Oct 28 '22 at 14:40
  • 1
    @dipetkov Thans for the link. On my side, I see that a Wikipedia article (https://en.wikipedia.org/wiki/Sørensen-Dice_coefficient) mentiones a metric $s_v$, which looks a lot like $F_1$ with squares in the denominator. Unfortunately, it gives no reference. – Igor F. Oct 28 '22 at 14:56
  • Two more possibly related references: sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification and Any soft version for precision/recall?. So far it seems that none of these variants are widely used. Is it because proper scoring rules are better anyway? So there is no point in trying to "fix" accuracy, precision, recall and F1? – dipetkov Oct 31 '22 at 11:20

0 Answers0