3

There's some discussion on what F-measure means. I understand that the beta parameter determines the weight of recall in the combined score. In specific one answer states that "for good models using the $F_{\beta}$ implies you consider false negatives $\beta^2$ times more costly than false positives." beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> +inf only recall).

If you want to weight precision or recall higher than the other, how do you decide on the beta? I'm a bit unclear on the math behind the F-measure, so does a beta = .5 mean that precision is weighted 2x as much as recall?

Carl
  • 13,084
skeller88
  • 289
  • 1
    From $\beta^2$, $\beta=0.5$ would suggest that precision would be weighted 4 times as much as recall, at least according to the one answer cited. – Carl Feb 29 '20 at 02:14

1 Answers1

4

Don't use F scores at all. Every criticism of accuracy collected at Why is accuracy not the best measure for assessing classification models? applies completely equally to precision, recall and all F scores. Instead, use proper scoring rules.

Stephan Kolassa
  • 123,354
  • I have a sneaking suspicion that there's additional concern with F-scores because they combine conditional probabilities with very different condition: Would you care to have a look at the last part of https://stats.stackexchange.com/a/99921/4598 and let me know what you think of that? – cbeleites unhappy with SX Aug 16 '22 at 12:21
  • I completely share your concerns about proper scoring rules, but it is possible to derive quadratic errors analogously to Brier's score that capture certain aspects of predictive behaviour like sensitivity, specificity or the predictive values (and in consequence, also an analogoue to the F-measures). They have the continuous behaviour and favorable variance properties of Brier's score/MSE, but are not themselves proper scoring rules since they focus on subgroups of cases. They can be combined, though (e.g. into Brier's score). Paper: http://arxiv.org/abs/1301.0264 – cbeleites unhappy with SX Aug 16 '22 at 13:05
  • ... I thus feel like the improper scoring aspect could be repaired. One may argue that they belong more to the decision aspect than the probability density prediction. Still I think it may be sensible to (even outside the application scenario we discuss in the paper) to have metrics at hand for the decision stage that use a closely related metric to what was used before. – cbeleites unhappy with SX Aug 16 '22 at 13:13