Is it valid to calculate the AUPRC of a model by undersampling one class and then re-oversampling its results?

Question

Given a model (returning scores between 0.0 and 1.0), and

1000 positives input examples
100000 negative input examples

I'd like to calculate the AURPC.

Calling the model is costly, so I don't want to invoke it on all 101000 (1000+100000) samples. (~2k samples would be ok.)

The "normal" approach would be to evenly subsample from both sets to get 20+2000 samples and just use them.

However, my idea is to do the following (uneven subsampling) instead:

Keep all 1000 positive examples, but subsample the negative ones to 1000.
Get the scores from the model for these 1000+1000 samples.
Repeat the array of the 1000 scores from the negative examples 100 times to blow up the scores again back to 1000+100000 (to have back the original positives-negatives ratio).
Use these "faked" 1000+100000 scores to calculate the AUPRC.

My experiment (simulation) shows, that the AURPC is more accurate (closer to the one of the full dataset) when using the "1000+1000" (uneven subsampling (with re-upsampling)) approach compared to using 20+2000 (even subsampling).

So my question is: Is my uneven-subsampling-with-re-upsampling approach a generally valid method for cost reduction? (Or was I just "lucky" in my simulations because of the chosen distribution or so?)

How do you figure that you reduce the number of times you invoke the model? — Dave, Jul 28 '23 at 20:14
I would be somewhat more comfortable if you used the 1000 negatives examples to estimate the distribution of them and then resample it as to avoid issues with ties. — usεr11852, Jul 28 '23 at 23:25
@usεr11852 Thanks for the suggestion! If I understand correctly, estimating the distribution would require prior knowledge about the scores, i.e., which curve (distribution type) to fit. Right? — Tobias Hermann, Jul 30 '23 at 07:45
I wouldn't suggest using a parametric distribution (e.g. Gaissian). Given there are about 1000 random negative points, one could estimates the eCDF and then generate new, synthetic negative scores directly from it. No "prior knowledge about the scores" would be needed. Maybe stratifying the selection of the negative instances to estimates scores for will be beneficial, (so their feature distribution is representative) but with 1000 points we should be pretty OK. — usεr11852, Jul 30 '23 at 21:31
@usεr11852 Ah, thanks for the explanation! I think your method can be used in any case, i.e., no matter if one uses the full dataset, an evenly subsampled one, or an unevenly subsampled one. It always increases the "resolution" for the integral. — Tobias Hermann, Jul 31 '23 at 07:23
Yes. But let us not overlook that we do want a synthetic population where the proportion of positives to negatives is the same as the one in the original population. (and also what is defined as "even subsampling" seems prone to sampling variability cause the size is small) — usεr11852, Jul 31 '23 at 09:08
@usεr11852 Thanks. I tried to implement the approach you suggested in my simulation, but so far the results don't look good. Maybe I'm sampling incorrectly from the eCDF? — Tobias Hermann, Jul 31 '23 at 10:19
evaluate(np.arange(0.0, 0.1, 0.0001) why this 0.1, shouldn't be 1.0? — usεr11852, Jul 31 '23 at 12:20
@usεr11852 Lol, of course, what a dull typo. Thanks a lot for spotting it! In the fixed version the CDF way at least wins sometimes. But most of the time, it does not. — Tobias Hermann, Jul 31 '23 at 13:52
@usεr11852 In another minimal example, I realized, my way of sampling from the CDF was still wrong, and found out how to do it correctly. It's very slow, but when porting this to our example, the result looks ok, i.e., your CDF approach is slightly better (unfortunately only a very small difference) than my "repeat" approach in most iterations. Thank you very much for this interesting journey. I learned a lot. :) — Tobias Hermann, Aug 01 '23 at 15:14
@usεr11852 why does the ECDF need to be computed / why not just directly do the one-liner np.random.choice(negatives_subset_100, replace=True, size=100_000)? — chicxulub, Aug 01 '23 at 23:32
@chicxulub: Primarily avoidance of ties as these can create artefacts (e.g. sawtooth shapes) in the PR curve. — usεr11852, Aug 01 '23 at 23:47

chicxulub · Accepted Answer · 2023-12-05T01:46:02.613

Is it valid to calculate the AUPRC of a model by undersampling one class and then re-oversampling its results?

Yes, but only if you're careful in the implementation. Throwing re-sampled scores into scikit-learn's average_precision_score gives overestimates at ~25% relative error for your data. (Overestimation is demonstrated in the first version of this answer, which is based on this older version of the simulation code. I haven't completely ruled out that scikit-learn's AUPRC estimate on the whole sample is biased. Either way, you need to be careful in the implementation.) These issues are well-known enough that there's a paper which proposed throwing out AUPRC in favor of a different metric:

Flach, Peter, and Meelis Kull. "Precision-recall-gain curves: PR analysis done right." Advances in neural information processing systems 28 (2015).

Back to your question. Empirically, the following re-sampling-based estimators are somewhat unbiased:

$\text{repeat}$: down-sample the more prevalent class, compute predictions, then repeat them back to the original size
$\text{up-sample}$: down-sample the more prevalent class, compute predictions, then randomly up-sample them back to the original size (from @usεr11852's comment)
$\text{re-weigh}$: down-sample the more prevalent class, compute predictions, then re-weigh them in the computation of AUPRC (derived in the next section).

My implementations of the pure down-sampling estimator—

$\text{down-sample}$: down-sample the dataset uniformly at random, paying no attention to the class

—appear to be slightly biased, but I don't know why.

Derivation of $\text{re-weigh}$ estimator

Repeating data can be codified as re-weighing in the AUPRC scoring function. So only the scoring function needs to be modified, not the data. Perhaps this fact saves you some compute for super large-scale evaluations.

Let $P$ represent the universe where data are not re-sampled. Let $Q$ represent the universe where we re-sample. In your re-sampling scheme, $P(Y = 1) = 1\text{k} / (1\text{k} + 100\text{k})$ and $Q(Y = 1) = 1/2$.

Given a threshold $t \in (0, 1)$, call $p_P(t)$ the model's precision on $P$ at $t$, call $r_P(t)$ the recall, and $f_P(t)$ the false positive rate, i.e.,

$$ \begin{equation*} p_P(t) = P(Y = 1 \: | \: \hat{Y} > t) \\ r_P(t) = P(\hat{Y} > t \: | \: Y = 1) \\ f_P(t) = P(\hat{Y} > t \: | \: Y = 0). \end{equation*} $$

The AUPRC score we'd like to estimate is that on $P$ :

$$ \begin{align*} \text{AUPRC} &= \int_{t} p_P(t) dr_P(t) \\ &\approx \sum_{i=2}^{n} p_P(t_i) (r_P(t_i) - r_P(t_{i-1})). \\ \end{align*} $$

In the sum above, $\{t_i\}_{i=1}^{n}$ is a grid of decreasing thresholds over $(0, 1)$, which implies that $r_P(t_i)$ monotonically increases.

The game is to estimate each $p_P(t_i)$ and $r_P(t_i)$ using metrics from $Q$—the universe where it's less costly to compute model metrics. The trick is to realize that $Q$ and $P$ are equivalent when they're conditioned on the class ($Y = 0$ or $Y = 1$), because the class frequency is the only difference between $Q$ and $P$. This fact implies $r_P(t) = r_Q(t)$ and $f_P(t) = f_Q(t)$. The precision calculation requires some manipulation:

$$ \begin{align*} p_P(t) &= \frac{r_P(t) P(Y=1)}{P(\hat{Y} > t)} && \text{Bayes' rule} \\ &= \frac{r_Q(t) P(Y=1)}{P(\hat{Y} > t)} && \text{class-conditional} \end{align*} $$

where

$$ \begin{align*} P(\hat{Y} > t) &= r_P(t) P(Y=1) + f_P(t) (1 - P(Y=1)) && \text{total probability} \\ &= r_Q(t) P(Y=1) + f_Q(t) (1 - P(Y=1)) && \text{class-conditional}. \end{align*} $$

Plugging it back in:

$$ \begin{align*} p_P(t) = \frac{r_Q(t) P(Y=1)}{r_Q(t) P(Y=1) + f_Q(t) (1 - P(Y=1))} \end{align*} $$

This proves that $P(Y=1), r_Q(t)$, and $f_Q(t)$ are sufficient to calculate $\text{AUPRC}$.

Simulations

Code for this simulation is available here.

Let's empirically evaluate the AUPRC estimators on the Beta-distributed predictions from your simulation:

Repeating the estimator definitions (for easier reference):

$\text{down-sample}$: down-sample the dataset randomly
$\text{repeat}$: down-sample the more prevalent class, compute predictions, then repeat them back to the original size
$\text{up-sample}$: down-sample the more prevalent class, compute predictions, then randomly up-sample them back to the original size (from @usεr11852's comment)
- as suggested in the linked comment, the implementation of the ECDF inverse includes linear interpolation. See the sample method in the code
$\text{re-weigh}$: down-sample the more prevalent class, compute predictions, then re-weigh them in the computation of AUPRC.

The simulation computes the signed difference between the estimator's observed AUPRC and the "true" AUPRC—that on the full dataset. Each estimator uses the same number of model() calls, i.e., the model compute is held constant.

Here are summary statistics:

       error down-sample  error repeat  error up-sample  error re-weigh
count            500.000       500.000          500.000         500.000
mean              -0.021         0.009           -0.002           0.009
std                0.041         0.027            0.020           0.027
min               -0.094        -0.042           -0.047          -0.042
25%               -0.050        -0.010           -0.015          -0.010
50%               -0.031         0.004           -0.005           0.004
75%                0.001         0.024            0.009           0.024
max                0.220         0.113            0.066           0.113

Notes:

$\text{up-sample}$ has the lowest bias and variance. It gets bonus points b/c you can repeatedly up-sample, which allows you to be transparent about the variance from up-sampling. Though there's no cheap way to convey the important source of variation—that from the initial down-sampling.
$\text{down-sample}$ has significantly higher variance than the rest because it only looks at a handful of the less prevalent class. It's also weird b/c the direction of bias depends on the implementation. The first version of my simulation demonstrates overestimation, as does your simulation. Both used scikit-learn's implementation.

Thanks a lot! What an amazing answer! I'm surprised, that they're all overestimators. At least from the down-sample approach, I would have expected a mean error of zero (even if I expected the standard deviation to be very high). It would be interesting to learn why. — Tobias Hermann, Aug 03 '23 at 13:21
To understand why, I'm trying to replicate the result of down-sample being an overestimator. However, in my simplified experiment, it's an underestimator! Any idea what's going on here? — Tobias Hermann, Aug 03 '23 at 14:03
In your experiment, negatives was accidentally copied instead of positives in the line positives_subset = np.random.choice(negatives, 10, replace=False). Fixing the line gives mean 0.05 and stdev 0.09. Still, I'd like to dig in to what exactly is causing overestimation. I was initially pretty certain that down-sample would not be biased. — chicxulub, Aug 03 '23 at 20:08
Oh, damn copy-paste mistake! Thanks a lot for spotting it, and sorry for the confusion. Now I get to the same conclusion as you do. Looking forward to your further results! — Tobias Hermann, Aug 04 '23 at 06:46
Or more generally phrased: The more samples we use, the lower the mean AURPC. Shall I create a separate question for this problem? — Tobias Hermann, Aug 04 '23 at 08:47
Hmm I think it might actually be a problem w/ the way sklearn does things. I manually coded up the $\text{re-weigh}$ estimator (instead of using roc_curve) and it now has low bias (0.008) and lower stdev (0.026). I'll see if I can debug the other estimators. — chicxulub, Aug 05 '23 at 05:45
Interesting! Thanks a lot! I wonder if it's possible to fix "down-sample" from this experiment in a similar way, i.e., without knowing the actual value for frac_positive_observed. — Tobias Hermann, Aug 07 '23 at 06:25
@TobiasHermann I'm taking a while to get down to the root cause of the issue in sklearn. In the meantime, I think you might wanna look into this paper: Precision-Recall-Gain Curves: PR Analysis Done Right. The authors show that AUPRC is difficult to estimate (as I'm learning the hard way lol), and propose an alternative metric. — chicxulub, Aug 07 '23 at 09:53
Also to clarify, we always know frac_positive_observed since it's just the fraction of positives in the full dataset. In your problem, e.g., it's 1k / 1k+100k — chicxulub, Aug 07 '23 at 12:09
@TobiasHermann see the updated answer. All estimators except $\text{down-sample}$ have been largely debugged. Given the difficulty in debugging $\text{down-sample}$, and the results of your most recent simulation, opening a new question is a good idea. It's almost definitely due to thresholding on small data. But I can't pinpoint why it causes this level of bias, and I can't come up w/ a universally better alternative. Some visualizations should help solve that mystery. Finally, sorry for providing you w/ an initially incorrect answer! I'm never gonna trust thresholded data again :-) — chicxulub, Aug 07 '23 at 12:16
Great answer (+1), just debugging sklearn implementation must have been a mess. — usεr11852, Aug 07 '23 at 13:07
@chicxulub Wow, amazing work!!! Do you consider contributing to scikit-learn? Looks like your implementation is an improvement over the existing one. — Tobias Hermann, Aug 08 '23 at 08:57
I've just opened a new question for the general situation, in which we just take different sample sizes and don't know frac_positive_observed. — Tobias Hermann, Aug 08 '23 at 08:58

Is it valid to calculate the AUPRC of a model by undersampling one class and then re-oversampling its results?

1 Answers1

Derivation of $\text{re-weigh}$ estimator

Simulations

Linked