Estimate recall for extremely rare events

Question

I want to estimate the recall of a binary classifier. I have a dataset of ~1B examples but I don't know the ground truth, the only thing I know is that positives are extremely rare. I can randomly sample let's say 1000 examples and assign a ground truth manually (a very expensive and time-consuming process). Given how rare events are, it's very likely that I need to label a very big sample to get a few positives and estimate metrics such as precision and recall.

Is there a way to estimate recall in such cases where positives are extremely rare and ground truth labeling of the entire dataset is infeasible (e.g. using smart sampling methods)?

Thank you!

(1) Once you have your manually assigned labels and your classifier, why can't you calculate FN? (2) Recall (and precision etc.) all suffer from the same issues as accuracy, especially in "unbalanced" situations. Given that you seem to be in a high-stakes environment, I strongly recommend you invest some time in reconsidering your evaluation metric. See also here. ... — Stephan Kolassa, Nov 28 '22 at 09:48
... (3) That said, is your question really one of how to determine data points to label manually (and expensively) to get the most precise estimate of recall (or a better evaluation metric)? — Stephan Kolassa, Nov 28 '22 at 09:49
Thank you for your answers. (1) my bad, I did not explain well. Given how rare are events (positives) I'm obliged to do a biased sampling to increase my probability of catching those positives. Such a biased sample won't give me useful information on the actual classifier performance. (2) I'm looking into it thanks! (3) my question is, as you pointed out, "how to determine data points to label manually to get the most precise estimate of recall" — Ricky, Nov 28 '22 at 12:18
as I think @StephanKolassa is hinting, the problem is perhaps your evaluation metric. Biased sampling is quite typical in the medical field for similar reasons. Assuming you have a probability output, it is relatively easy to reweight the estimate. — seanv507, Nov 28 '22 at 20:44
@seanv507: I agree in general but particularly in cases where we have no prior knowledge we might have a biased sample but with no way to de-bias it. For example, we might expect older men to exhibit a particular trait more than younger women but without some baseline, we cannot really make any real assessment about it; is it 3x times more likely? 6x? 18x? Especially if we start sampling our "high-chance" instances and we do not find positive cases, what happens then? — usεr11852, Jan 05 '23 at 14:08

score 0 · Accepted Answer · answered Jan 05 '23 at 15:48

Ultimately this is not a precision/recall calculation question but rather a sample labelling/data quality one. To that extent, there are no silver bullets. If we are unwilling to invest the time we cannot expect to get the appropriate answer. I would suggest starting with a literature review to construct a loose prior about the proportion $p$. Then we do ahead with random sampling and we update that posterior as we move forward.

As already mentioned, precision and recall might not be the best metrics to employ. In an imbalanced learning setting, assigning relevant misclassification costs can be very helpful. This can help you choose the threshold for the actual label assignment given (well-calibrated) probabilistic estiamtes.

Regarding the "smart sampling methods" commentary: While counter-intuitive my suggestion would be to avoid them for now. Stick with simple random sampling (SRS) in order to have an unbiased baseline. It is tempting to start thinking about active learning or other domain adaptation methods applied to sample bias correction. These methods can be great if we are trying to indeed find positives and/or find a way to distinguish positive from negative examples (in short, we would try to sample near our classifier's decision boundary). They would nevertheless give us a biased view of what is the true baseline rate without a way of de-biasing it if we do not already have access to some unbiased dataset. That is not "ML's fault", even more classical statistics methods to debias convenience samples/observational data (e.g. importance sampling or post-stratification) will need to have some known population/reference distribution to work with. If that is unavailable, their applicability is moot. Therefore start with SRS (or some relatively straightforward sampling strategy) to build that reference distribution and then do something fancier.

Estimate recall for extremely rare events

1 Answers1