Why is ROC AUC equivalent to the probability that two randomly-selected samples are correctly ranked?

Question

I found there are two ways to understand what AUC stands for but I couldn't get why these two interpretations are equivalent mathematically.

In the first interpretation, AUC is the area under the ROC curve. Picking points from 0 to 1 as threshold and calculate sensitivity and specificity accordingly. When we plot them against each other, we get ROC curve.

The second one is that the AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example, i.e. P(score(x+)>score(x−)). (from What does AUC stand for and what is it?)

Artem Sobolev · Accepted Answer · 2022-06-28T07:04:20.113

It's easy to see once you obtained a closed-form formula for AUC.

Since we have finite number of samples $\{(x_i, y_i)\}_{i=1}^N$, we'll have finite number of points on the ROC curve. We do linear interpolation in between.

First, some definitions. Suppose we'd like to evaluate an algorithm $A(x)$ that outputs a probability of $x$ lying in the positive class $+1$. Let's define $N_+$ as the number of samples in the positive class $+1$ and $N_-$ as the number of samples in the negative class $-1$. Now, for a threshold $\tau$ let's define False-Positive-Rate (FPR, aka 1-specificity) and True-Positive-Rate (TPR, aka sensitivity):

$$ \text{TPR}(\tau) = \frac{\sum_{i=1}^N [y_i = +1] [A(x_i) \ge \tau]}{N_+} \quad \text{and} \quad \text{FPR}(\tau) = \frac{\sum_{i=1}^N [y_i = -1] [A(x_i) \ge \tau]}{N_-} $$

(where $[\text{boolean expression}]$ is 1 if expression is true, and 0 otherwise). Then, ROC curve is built from points of the form $(\text{FPR}(\tau), \text{TPR}(\tau))$ for different values of $\tau$. Moreover, it's easy to see that if we order our samples $x_{(i)}$ (note the parentheses) according to the algorithm's output $A(x_i)$, then neither $\text{TPR}$ nor $\text{FPR}$ changes for $\tau$ between consecutive samples $A(x_{(i)}) < \tau < A(x_{(i+1)})$. So it's enough to evaluate FPR and TPR only for $\tau \in \{A(x_{(1)}), \dots, A(x_{(N)})\}$. For $k^{\text{th}}$ point we have

$$ \text{TPR}_k = \frac{\sum_{i=k}^N [y_{(i)} = +1]}{N_+} \quad \text{and} \quad \text{FPR}_k = \frac{\sum_{i=k}^N [y_{(i)} = -1]}{N_-} $$

(Note both sequences are non-increasing in $k$). These sequences define x and y coordinates of points on the ROC curve. Next, we linearly interpolate these points to get the curve itself and calculate area under the curve (Using a formula for area of a trapezoid):

$$ \begin{align*} \text{AUC} &= \sum_{k=1}^{N-1} \frac{\text{TPR}_{k+1} + \text{TPR}_{k}}{2} (\text{FPR}_{k} - \text{FPR}_{k+1}) \\ &= \sum_{k=1}^{N-1} \frac{\sum_{i=k+1}^N [y_{(i)} = +1] + \tfrac{1}{2} [y_{(k)} = +1]}{N_+} \frac{[y_{(k)} = -1]}{N_-} \\ &= \frac{1}{N_+ N_-} \sum_{k=1}^{N-1} \sum_{i=k+1}^N [y_{(i)} = +1] [y_{(k)} = -1] = \frac{1}{N_+ N_-} \sum_{k < i} [y_{(k)} < y_{(i)}] \end{align*} $$

Here I used the fact that $[y = -1] [y = +1] = 0$ for any $y$.

So there you have it: AUC is proportional to the number of correctly ordered pairs, which is proportional to the probability of random pair of samples being ranked according to their labels.

EDIT (6 years later): Since for $a, b \in \{-1, +1\}$ we have $[a < b] = 1$ only when $a = -1$ and $b = +1$, it's easy to see that $$ \frac{1}{N_+ N_-} \sum_{k < i} [y_{(k)} < y_{(i)}] = \frac{1}{N_+ N_-} \sum_{\substack{k < i \\ y_{(i)} = 1 \\ y_{(k)} = -1}} [y_{(k)} < y_{(i)}] $$

In essence, we form all possible negative-positive pairs and see what fraction of them is correctly ordered according to our algorithm $A$, that is, $A($positive sample$)\; > A($negative sample$)$.

Thanks @Barmaley.exe. This is really helpful. Finally connect the dots. A following up question though, is how AUC is equivalent to Mann Whitney U test. — Felicia.H, Jan 15 '16 at 19:48
I don't have an answer to that, but either way it'd not be a good idea to answer another (mostly unrelated) question here. I suggest you create a new thread. — Artem Sobolev, Jan 15 '16 at 20:09
This is all covered in Hanley and McNeil 1982: https://pubs.rsna.org/doi/10.1148/radiology.143.1.7063747 (Wilcoxon-Mann-Whitney equivalence etc.). — Frank Harrell, Jan 13 '19 at 13:17
Can you please clarify how did the numerator before the last step is rearranged? where did $1/2[y_(k) = +1]$ go? — christopher, May 11 '20 at 20:51
Hi @ArtemSobolev and thank you for your input. I did not yet carefully think about the derivation but one question though: Is the set ${A(x_{(1)}), \dots, A(x_{(N)})}$ enough? I mean when $\tau \in A(x_{(1)})$ everything is classified as positive and when $\tau \in A(x_{(N)})$ only the $(x_{(N)}, y_{(N)})$ sample is classified as positive. Should we still try another threshold like $A(x_{(N)})+c$ where $c$ is e.g. a very small number? Wouldn't this threshold be the case where everything is classified as negative? Wouldn't we now cover all the $(TPR(\tau), FPR(\tau))$ cases? — jjepsuomi, Nov 04 '20 at 16:40
To add to the previous, would it be more correct to say instead of: "...it's enough to evaluate FPR and TPR only for $\tau \in {A(x_{(1)}), \dots, A(x_{(N)})}$..." that "...it's enough to evaluate FPR and TPR only for $\tau \in {A(x_{(1)}), \dots, A(x_{(N)})}$ which cover all the TPR,FPR cases except the one where all points are classified as negative." — jjepsuomi, Nov 04 '20 at 16:59
This is a clear answer compared to the answers in the question linked here. But I wonder whether the last part $\sum_{(k < i)} [y_{(k)} < y_{(i)}]$ is really explains what we want here. Since $y_{(k)} < y_{(i)}$ for all $k < i$ based on our ordering. I think some conditioning needs to be there. Specifically $y_{(k)}$ has to be a true negative sample and $y_{(i)}$ has to be a true positive sample. — user2939212, Jun 27 '22 at 00:02
@user2939212, I think they are the same thing. Then $y_{(k)}=1$ then $y_{(k)} < y_{(i)}$ is always false, since both $1 < -1$ and $1 < 1$ is false. Same when $y_{(i)} = -1$. Hence the cases when $y_{(i)}$ is not a true positive or $y_{(k)}$ is not a true negative only contribute zeroes to the sum. — Artem Sobolev, Jun 28 '22 at 06:45
@user2939212, that said, having $N_+ N_-$ as normalizers indeed hints us that we should be enumerating a cartesian product of two sets, one of size $N_+$ and the other of size $N_-$, which is exactly what you're suggesting in the last sentence. I've added a paragraph on that. — Artem Sobolev, Jun 28 '22 at 07:04

Why is ROC AUC equivalent to the probability that two randomly-selected samples are correctly ranked?

1 Answers1

Linked

Related