7

We have two models outputting estimates of class probabilities. Combined with a probability cutoff / threshold, these yield classification decisions: if the estimated probability of class 1 is above the threshold, the assigned label is class 1; otherwise, it is class 0. We want to compare the models in terms of their estimated expected loss for a given threshold. The loss function $L(\hat{Y},Y)$ is given by \begin{aligned} L(0,0)&=0, \\ L(0,1)&=a, \\ L(1,0)&=b, \\ L(1,1)&=0 \end{aligned} with $a,b>0$. The estimated losses on a test subsample are not available. However, we have the ROC curves on the test subsample for each model. From each curve, we can obtain the point that corresponds to the optimal threshold as derived from the loss function. (By optimal threshold I mean the one that minimizes the estimated expected loss. For the general loss function specified above, it is $\frac{b}{a+b}$, assuming the estimated probabilities are reasonably well calibrated.)

Question: Does this pair of points on the ROC curves contain sufficient information to conclude which classification algorithm has the lower estimated expected loss? If not, could you offer a counterexample?

Related question: "Are non-crossing ROC curves sufficient to rank classifiers by expected loss?".

Richard Hardy
  • 67,272
  • For general loss functions? If so, what conditions do they fulfill? Or can you give any other information on the loss function you are working with? – Stephan Kolassa Oct 30 '21 at 18:10
  • 1
    @StephanKolassa, what we know about the loss function is $L(0,0)=L(1,1)=0, L(0,1)=a, L(1,0)=b$ with $a,b>0$. The optimal threshold is $\frac{b}{a+b}$. – Richard Hardy Oct 30 '21 at 18:47
  • so the loss is calculated based on the thresholded yhat that can be only 0 or 1, but nothing in between? – rep_ho Nov 05 '21 at 10:18
  • @rep_ho, the loss is based on a classification that uses the optimal threshold $\frac{b}{a+b}$ which can be neither 0 nor 1 by construction ($a,b>0$). $\hat{Y}$ is either 0 or 1 as it is a class label. (Not sure if I answered your question.) – Richard Hardy Nov 05 '21 at 10:31
  • but the each algorithm does output continuous prediction, that is used to construct the ROC curves right? – rep_ho Nov 05 '21 at 10:35
  • @rep_ho, yes, the algorithm outputs the estimated probabilities, not the class labels. The labels are decided only subsequently, and the decision is based on the optimal threshold. Basically, I wonder if I can rank a pair of algorithms just by looking at their ROC curves when my ranking criterion is expected loss. Imagine that the two ROC curves do not cross, so one algorithm seems superior to the other for every possible threshold. Is it also superior in terms of expected loss, or does that not follow from non-crossing ROC curves? – Richard Hardy Nov 05 '21 at 10:36
  • Richard, how do you conclude that the threshold should be b/(a+b)? Isn't the threshold also dependent on the false positive rate, the false negative rate, and the distribution of false and true among the population? – Sextus Empiricus Sep 08 '22 at 16:45
  • This question seems about "we use a ROC curve and a loss function, which is based on false positive and false negative rates, to determine an expected loss" and the problem is "can we use the obtained expected loss to rank classifiers". The problems that may arrise are that 1 comparing the loss distributions of two classifiers only by means of the expectation might not be the desired measure (e.g. one classifier might have a lower average loss but a larger variance of the loss) 2 The estimation of the expected loss is incorrect 3 Error in estimates not included in analysis. – Sextus Empiricus Sep 08 '22 at 18:14

2 Answers2

3

I think the answer is no as the expected loss depends on $P(Y = 1)$, and this information isn't given by a ROC curve.

Let's say you have a binary random variable $Y$ with $p = P(Y = 1)$, and note $\hat{Y}_t$ a classifier depending on a treshold (or more generally a parameter) $t$.

The expected loss of classifier $\hat{Y}_t$ is $$ \begin{array}{ccl} L(\hat{Y}_t) &= &a P(\hat{Y}_t = 0 \cap Y = 1) + b P(\hat{Y}_t = 1 \cap Y = 0)\\ & = & a \cdot p\cdot P(\hat{Y}_t = 0 \mid Y = 1) + b \cdot (1-p)\cdot P(\hat{Y}_t = 1 \mid Y = 0). \end{array}$$

The ROC curve only gives you the conditional probabilities $P(\hat{Y}_t = 1 \mid Y = 0)$ and $P(\hat{Y}_t = 0 \mid Y = 1)$ as a function of $t$, but they don't give you $p$.

Consider for example two (very bad) classifiers $\hat{Y}_t$ and $\hat{Z}_t$ with $$ \begin{array}{ccl|ccl} P(\hat{Y}_t = 1 \mid Y = 0) &=& 1/2 & P(\hat{Z}_t = 1 \mid Y = 0) & = & t\\ P(\hat{Y}_t = 0 \mid Y = 1) &=& 1 - t & P(\hat{Z}_t = 0 \mid Y = 1) & = & 1/2 \end{array} $$ giving the following ROC curves.

The the expected loss of $\hat{Y}_t$ is $$L(\hat{Y}_t) = \frac{1}{2}(1-p)b + (1-t)a p$$ and the expected loss of $\hat{Z}_t$ is $$L(\hat{Z}_t) = t b (1-p) + \frac{1}{2} a p.$$ There is no way to tell which is the best without knowing $p$...

However, if a ROC curve dominates another, meaning it's above it all the time, then you known that whatever probability $p$ and whatever losses $a$ and $b$, the dominating classifier will have lower expected loss than the other (it follows directly from the expression of the expected loss).

Indeed, if the roc curve of $\hat{Y}_t$ is above the ROC curve of $\hat{Z}_t$, then for each $t$, the ROC curve of $\hat{Y}_t$ is either left or above (or both) of the ROC curve of $\hat{Z}_t$, this implies that $$P(\hat{Y}_t = 1 \mid Y = 0) \leq P(\hat{Z}_t = 1 \mid Z = 0)$$ and $$P(\hat{Y}_t = 1 \mid Y = 1) \geq P(\hat{Z}_t = 1 \mid Z = 1), $$ and thus $$P(\hat{Y}_t = 0 \mid Y = 1) \leq P(\hat{Z}_t = 0 \mid Z = 1).$$

Then, for any $a \geq 0$, $b\geq 0$ and $0\leq p \leq 1$, $$\begin{array}{ccl} L(\hat{Y}_t) & = & P(\hat{Y}_t = 0 \mid Y = 1) \cdot p \cdot a + P(\hat{Y}_t = 1 \mid Y = 0) \cdot (1-p) \cdot b \\ & \leq &P(\hat{Z}_t = 0 \mid Y = 1) \cdot p \cdot a + P(\hat{Z}_t = 1 \mid Y = 0) \cdot (1-p) \cdot b \\ & = & L(\hat{Z}_t) \end{array}. $$

I hope this helps.

Richard Hardy
  • 67,272
Pohoua
  • 2,548
  • Comments are not for extended discussion; this conversation has been moved to chat. – Sycorax Nov 11 '21 at 14:42
  • While we did not really get to the bottom of it, your answer was helpful, so you are getting the bounty :) – Richard Hardy Nov 16 '21 at 10:15
  • I have decided to split the thread, as the two questions in it were distinct enough so that we got one of them answered but not the other one. I have linked the new thread by updating my post. You may consider splitting off your answer to the second question from this answer and posting it in the new thread instead. Thank you. – Richard Hardy Nov 17 '21 at 04:18
1

Given a threshold* $t$, model 1 has lower estimated expected loss than model 2 if the corresponding ROC point of model 1 dominates** the ROC point of model 2. Here is why.

Let the confusion matrix corresponding to a particular threshold $t$ be $$ \text{Conf}_t=\begin{pmatrix} j_t & k_t\\ l_t & m_t \end{pmatrix} $$ with predicted classes in rows (row 1 ~ class 0, row 2 ~ class 1) and actual classes in columns (column 1 ~ class 0, column 2 ~ class 1). Concretely, $$ \text{Conf}_t=\begin{pmatrix} \#\{{\hat Y=0 \cap Y=0\}}_t & \#\{{\hat Y=0\cap Y=1\}}_t \\ \#\{{\hat Y=1\cap Y=0\}}_t & \#\{{\hat Y=1\cap Y=1\}}_t \end{pmatrix} $$ with $\#$ counting the number of elements that satisfy the condition. We will later add a subscript 1 for model 1 and 2 for model 2.

For any given sample, the number of actual zeros $j_t+l_t$ and the number of actual unities $k_t+m_t$ are fixed at $r$ and $s$, respectively: \begin{aligned} j_t+l_t &= r \quad \text{and} \\ k_t+m_t &= s. \end{aligned} We will make use of the latter equality in a subsequent step. Let us also define the sample size $$ n:=j_t+k_t+l_t+m_t. $$

The estimated expected loss of a model is \begin{aligned} \hat{\mathbb{E}}(L) &= \frac{1}{n}\big[ak_{t}+bl_{t}\big] \\ &= \frac{1}{n}\big[a(s-m_{t})+bl_{t}\big]. \end{aligned} Explicitly, the estimated expected losses of models 1 and 2 are \begin{aligned} \hat{\mathbb{E}}(L_1) &= \frac{1}{n}\big[a(s-m_{1t})+bl_{1t}\big] \quad \text{and} \\ \hat{\mathbb{E}}(L_2) &= \frac{1}{n}\big[a(s-m_{2t})+bl_{2t}\big]. \end{aligned}

The ROC points (specific to the threshold $t$) of models 1 and 2 have coordinates $(l_{1t},m_{1t})$ and $(l_{2t},m_{2t})$, respectively. If the former point dominates the latter point, we have $l_{1t}\leq l_{2t}$ and $m_{1t}\geq m_{2t}$ and at least one of the two inequalities is strict.

What does this imply regarding $\hat{\mathbb{E}}(L_1)$ vs. $\hat{\mathbb{E}}(L_2)$? Since $a,b>0$ and $s-m_{1t},s-m_{2t}\geq0$, then looking at the formulas above we immediately see that $\hat{\mathbb{E}}(L_1)<\hat{\mathbb{E}}(L_2)$. Thus if the ROC point of model 1 dominates the ROC point of model 2, model 1 has lower estimated expected loss than model 2.

*The relevant threshold would be the optimal one. **{ Is above and to the left } OR { is above and not to the right } OR { is to the left and not below }.

Richard Hardy
  • 67,272