7

Is there a "soft" version for the ye-olde precision and recall metrics? Precision (and recall) are defined given binary decisions, i.e.

precision=sum(marked_as_positive* is_positive)/sum(marked_as_positives)

Where marked_as_positive equals 0 or 1. Has anyone encountered a version that use probabilities instead of binary decisions, i.e.

sum(P(is_positive)*is_positive)/sum(P(is_positive))

Where P(is_positive) is between 0 to 1 and represents the probability that a given sample is positive as assigned by some classifier?

I'm aware of logloss, AUC and similar "soft" metrics, but for some reason never encountered the one above - which makes me suspect that there's something very wrong with using it.

r0u1i
  • 193
  • 1
    Look up precision-recall curves. – Jeffrey Girard Oct 09 '16 at 15:11
  • It does seem like a weird metric. I don't see the appeal of it. – Kodiologist Oct 09 '16 at 19:05
  • @kodiologist - it's like precision but for tasks where you are going to weigh examples by their probability. Example: given a set of subscribers predict how many (and not specifically who) will churn in the next year. – r0u1i Oct 09 '16 at 19:56
  • 1
    @r0u1i But you don't have the true probabilities; you only have your predicted probabilities, which you're comparing to binary observations. It doesn't make make sense to me to weight observations by a prediction about them. The standard tool for comparing probabilistic predictions to discrete outcomes is proper scoring rules. – Kodiologist Oct 09 '16 at 21:27

1 Answers1

2

There are different scenarios that make such "partial class memberships" sensible (in different ways) for both prediction [that is quite straightforward] and reference.

Remote sensing discusses the "problem of mixed pixels" which are not probabilities but fractions as in fuzzy sets - true classes are mixed because of low [spatial] resolution. For literature, see e.g. the references in the paper linked below.

I've been looking into "soft" figures of merit from a chemometric perspective with a medical application in my PhD (and in a scenario where we have both probabilities as in the reference diagnosis is not entirely certain about class and mixture of pure classes as in the measurement volume, several classes of cells occur). In that context I found it better to sort out optimism/pessimism (as in optimistic/pessimistic bias) in the derived figures of merit: the uncertainty in the reference for both probability and mixture translates into a range for the figure of merit that is in accordance with the observed reference and prediction labels. I find it convenient to state figures of merit as worst case - expected case [under certain assumptions] - best case (as the medical application was a scenario of advising surgeons to cut out brain tissue, the worst case performance of such a diagnostic tool would be the one to focus on - whereas I found that the remote sensing literature tended to use the optimistic best case numbers).


Your multiplication for the AND-operator (precision = fraction of cases that are predicted true AND are true by reference of all cases predicted true) leads to the expected precision above.


BTW: the expectation can be expressed in a way that is closely related to Brier's score (a proper scoring rule), and to typical regression-error figures of merit.

  • Thanks a lot! In your above mentioned paper I found definitions for sensitivity, specificity, PPV and NPV. However, I couldn't find a definition of precision. Can you help me there? – bonanza Aug 17 '17 at 08:12
  • 1
    @bonanza: recall = sensitivity, precision = positive predictive value (PPV) (there are lots of synonyms for those figures of merit, different fields use different sets of terms) – cbeleites unhappy with SX Aug 19 '17 at 12:54