6

We have two models outputting estimates of class probabilities. Combined with a probability cutoff / threshold, these yield classification decisions: if the estimated probability of class 1 is above the threshold, the assigned label is class 1; otherwise, it is class 0. We want to compare the models in terms of their estimated expected loss for a given threshold. The loss function $L(\hat{Y},Y)$ is given by \begin{aligned} L(0,0)&=0, \\ L(0,1)&=a, \\ L(1,0)&=b, \\ L(1,1)&=0 \end{aligned} with $a,b>0$. The estimated losses on a test subsample are not available. However, we have the ROC curves on the test subsample for each model. We do not have the data behind the two ROC curves, but we observe visually that the first ROC curve entirely dominates the second one, i.e. the two curves never cross (only touch in the bottom left and top right corners).

Question: Is that sufficient to conclude that the first model has lower estimated expected loss than the second one (for a given threshold)? If not, could you offer a counterexample?

Related question: "Is a pair of threshold-specific points on two ROC curves sufficient to rank classifiers by expected loss?".

Richard Hardy
  • 67,272
  • Do you consider differences between the ROC curves beyond the curves themselves? (e.g. discrepancies might arrise because ROC curves are estimates and the error of the estimates can differ. Then one curve might be dominating but due to a difference in the error of the estimate the expected loss could turn out different) Or are you considering the ROC curves as a given and all other things equal? – Sextus Empiricus Sep 08 '22 at 17:13
  • @SextusEmpiricus, at this point I have set the effects of estimation imprecision aside. Yet I acknowledge its presence in writing Is that sufficient to conclude that the first model has lower estimated* expected loss...* I also specify repeatedly that I am considering estimates, not the true values. – Richard Hardy Sep 09 '22 at 06:26
  • My problem/confusion is that expectation values involve some randomness. It is unclear to me which types of randomness you take into consideration. The most obvious randomness is that one can consider the number of false future predictions of a group of positive and negative cases as being binomial distributed, and that binomial distribution is characterized by the false positive rate and false negative rate. So your question is only about that source of randomness and not about randomness in estimates of the false positive rates and false negative rates? – Sextus Empiricus Sep 09 '22 at 07:13
  • Also, we do not consider classifiers that have dependencies/correlations among predictions? – Sextus Empiricus Sep 09 '22 at 07:16
  • Are you considering more complicated loss functions e.g. where two times a false negative is not $2 \times a$ but instead some non-linear function? – Sextus Empiricus Sep 09 '22 at 07:18
  • 1
    @SextusEmpiricus, I am only considering the case where two times a false negative is $2\times a$. Regarding randomness, I am happy to set it aside and assume ROC has been estimated perfectly. The core of my question concerns ROC vs. loss functions; randomness caused by sampling from a population is a nuisance here. – Richard Hardy Sep 09 '22 at 07:57
  • I believe that I know sufficient now to formulate answers to both your questions. – Sextus Empiricus Sep 09 '22 at 08:58
  • @SextusEmpiricus, excellent news! – Richard Hardy Sep 09 '22 at 09:12

2 Answers2

3

We do not have the data behind the two ROC curves, but we observe visually that the first ROC curve entirely dominates the second one, i.e. the two curves never cross (only touch in the bottom left and top right corners).

Question: Is that sufficient to conclude that the first model has lower estimated expected loss than the second one (for a given threshold)? If not, could you offer a counterexample?

Yes, it is a sufficient condition if we also assume that the choice of the threshold leads to an expectation of the loss function that is optimized.

(E.g. no weird algorithm that might choose some bad threshold, such that the dominant curve might not work optimal)


Let's define the probability of $Y = 1$ and $Y= 0$ as respectively $p_Y$ and $q_Y=1-p_Y$.

Let's use $f_{TP}$ as true positive rate (sensitivity) and $f_{FP}$ as false positive rate (1-specificity).

Then for the given loss function the expected loss can be written as a linear function of the sensitivity and specificity.

$$\begin{array}{} E[L(\hat{Y},Y)] &=& P(\hat{Y} = 0, Y = 0) \cdot 0 + P(\hat{Y} = 0, Y = 1) \cdot a + P(\hat{Y} = 1, Y = 0) \cdot b +P(\hat{Y} = 1, Y = 1) \cdot 0 \\ &=& p_{Y} (1-f_{TP}) a + q_{Y} (f_{FP}) b \\ &=& - p_{Y} f_{TP} a + q_{Y} f_{FP} b +p_{Y} a \\ &=& -a' f_{TP} + b' f_{FP} + c' \end{array}$$

Where $a' = a p_{Y}$ and $b' = b q_{Y}$ are re-expressions of the loss that include the probability of different values of $Y$ and $c' = p_{Y} a$ is a constant.

When the true positive rate is higher and/or when the false positive rate is lower, then the expected loss is lower. Thus, for every choice of a point on a ROC curve that optimizes $E[L(\hat{Y},Y)]$, if another second ROC curve dominates that first ROC curve, then there is a point with a higher or at least the same sensitivity and/or specificity, and thus the second ROC curve should have at least a point with a lower or at least the same expected loss, and an optimum for that second ROC curve should be at least the same or higher.

Example

Let's consider two classifiers that are based on a feature that follows a normal distribution with standard deviation equal to one for the two different classes $Y=1$ and $Y=0$. For the one dominant classifier the difference in the means is 1.5, for the non-dominant classifier the difference in the means is 1. Let's consider equal probabilities of the two classes and the cost is $cost = - \left(1/\sqrt{e}\right) f_{TP} + f_{FP}$, then this looks as follows:

example plot

The plot shows the two ROC curves in thick lines. The diagonal lines are iso-curves for combinations of false positive rates and true positive rates with the same cost.

The point A is the optimal value for the non-dominant classifier. The arrows to points C1 and C2 show that the dominant classifier has two points that are as least as good (lower cost) than the optimal point for the non-dominant classifier. The point B is the optimum for the dominant classifier and is at least as good as the points C1 and C2. Since the cost in point B is equal or lower than the costs in points C1 and C2, and the costs in points C1 and C2 are lower than the cost in point A, the optimum of the dominant classifier has to be at equal or lower cost than the optimum for the non-dominant classifier.


In more complex situations, like considering noisy ROC curves, the situation might be different. For example, the dominant ROC curve might be dominant due to overfitting and lead to a bad choice of the threshold.


R-code for the plot:

xs = seq (-5,5, 0.01)
f = dnorm(1)/dnorm(0)

plot the two curves

plot(pnorm(xs,0.5),pnorm(xs,-0.5), type = "l", lwd = 2, xlab = "specificity", ylab = "sensitivity", main = "comparing costs for two ROC curves \n of which one is dominant") lines(pnorm(xs,0.75),pnorm(xs,-0.75), lwd = 2)

plot the isolines

for (cost in seq(-1.65,1.65,0.15)) { lines(xs, 1-(cost-xs)/f, lty = 2) xt = 0.075+cost*0.7 text(xt,1-(cost-xt)/f+0.05,paste0("cost = ",cost),srt=55, cex = 0.7) }

the two arrows

shape::Arrows(pnorm(-0.5,0.5),pnorm(-0.5,-0.5), pnorm(-0.5,0.5), pnorm(qnorm(pnorm(-0.5,0.5),0.75),-0.75), arr.adj = 1, arr.type = "curved") shape::Arrows(pnorm(-0.5,0.5),pnorm(-0.5,-0.5), pnorm(qnorm(pnorm(-0.5,-0.5),-0.75),0.75), pnorm(-0.5,-0.5), arr.adj = 1, arr.type = "curved")

points C1 and C2

points(pnorm(-0.5,0.5), pnorm(qnorm(pnorm(-0.5,0.5),0.75),-0.75), pch = 21, bg = 0) points(pnorm(qnorm(pnorm(-0.5,-0.5),-0.75),0.75), pnorm(-0.5,-0.5), pch = 21, bg = 0) text(pnorm(-0.5,0.5), pnorm(qnorm(pnorm(-0.5,0.5),0.75),-0.75), "C1", pos = 4, font = 2, col = rgb(0.3,0.3,0.3)) text(pnorm(qnorm(pnorm(-0.5,-0.5),-0.75),0.75), pnorm(-0.5,-0.5), "C2", pos = 2, font = 2, col = rgb(0.3,0.3,0.3))

points A and B

points(pnorm(-0.5,0.5),pnorm(-0.5,-0.5), pch = 21, bg = 0) text(pnorm(-0.5,0.5),pnorm(-0.5,-0.5), "A", pos = 4, font = 2) points(pnorm(-0.33,0.75),pnorm(-0.33,-0.75), pch = 21, bg = 0) text(pnorm(-0.33,0.75),pnorm(-0.33,-0.75), "B", pos = 2, font = 2)

Richard Hardy
  • 67,272
  • I guess July is a slow month, so if I wanted to make this thread more visible, perhaps I should have waited with the bounty until late August or so. But this only occurred to me once I had put it, so there it is. Thank you for your great answer! – Richard Hardy Jul 14 '23 at 17:17
0

ROC curves are incompatible with optimal decision making because each point on the curve conditions on the future to predict the past. See https://www.fharrell.com/post/mlconfusion. For useful measures of predictive performance see https://fharrell.com/post/addvalue.

Expected loss is computed from measures that respect the direction of information flow, e.g., predicting the future from the past.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • 1
    Thank you. I am still looking for a (counter)example where the ranking by ROC curves conflicts with the ranking by expected loss. Any hints on how to construct one would be appreciated. – Richard Hardy Nov 17 '21 at 13:19
  • 2
    Your statement cannot be true in general because the loss function is completely separate from the data and can be quite nonlinear and even discontinuous. – Frank Harrell Nov 17 '21 at 13:26
  • If that is another argument in addition to what you have in the answer, it should make constructing a counterexample even simpler. Given how many capable users we have on the network, my hopes are increasing. On the technical side, however, the loss function has only two arguments, the binary predicted label $\hat{Y}$ and the binary true label $Y$, and only three possible values, $0,a,b$. I wonder if we can squeeze any complicated patterns out of this 4-bit to 2-bit mapping. Also, I do not quite understand in what sense the expected loss respects information flow. – Richard Hardy Nov 17 '21 at 14:27
  • No that's not the correct definition of a loss function. A loss function, also called a disutility or cost function, takes not only those things into account but also takes into account consequences of all the possible decisions. Although you can speak of things like "mean squared error loss" it is the full loss function that is used in decision making. – Frank Harrell Nov 17 '21 at 16:51
  • 1
    As far as I can tell, I am doing exactly that. I have explicitly specified $L(\hat{Y},Y)$ alongside both its input domain (${0,1}\times{0,1}$) and output domain (${0,a,b}$) explicitly. I have not mentioned mean squared error loss anywhere. I am of course talking about evaluation loss (costs of decisions at different states of nature), not fitting loss (what is optimized in parameter estimation). – Richard Hardy Nov 17 '21 at 17:09
  • I may be missing something basic. What's an example of an L( , ) that represents non-date consequences of decisions? – Frank Harrell Nov 17 '21 at 19:09
  • The example is given in my post by listing all possible combinations of the function's arguments and the corresponding values (losses). So the function is defined by explicit enumeration. – Richard Hardy Nov 18 '21 at 06:02
  • 1
    Thanks. My bet is that there exist parameters of the loss function for which a dominating ROC curve is not sufficient. But I can't point to a proof; looking forward to more info. – Frank Harrell Nov 18 '21 at 13:39
  • My guess is the same as your bet. Thank you for the discussion! – Richard Hardy Nov 18 '21 at 13:47
  • I have no idea why this answer was flagged as low quality and thus I am rejecting the flag. – User1865345 Mar 01 '24 at 11:45