What is "better-than-random" precision in clustering?

Question

In Section 6.3.1 of the paper "No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems", it is mentioned that the algorithm proposed by the paper has better-than-random precision. What does better-than-random precision actually mean? Could you please help me understand this concept with some example?

Thank you!

This was a surprisingly fun paper! :) (+1) (And surprisingly the precision is actually quite a crucial point for this to work!) — usεr11852, Nov 22 '22 at 03:43

score 2 · Answer 1 · answered Nov 22 '22 at 03:42

The authors mean to say that the achieved Precision is better than the baseline Precision we would get at random. For example, if we have 1000 examples with 200 being positive and 800 being negative, if we at-random pick 150 points and label them all as positive, we expect ~30 out of them to be positive and thus have ~20% (=$\frac{\sim30}{150}$) Precision. This CV.SE thread on What is "baseline" in precision recall curve explores this further. (Like-wise our Recall-at-random is directly related to the size of the subsample we choose to label as positive - if we randomly label 50% of our whole sample as positive we will find ~50% of our positive instances so our Recall will be ~50% too.)

For this paper here, the authors report the precision metric because the clustering has to be able to detect clusters with high enough proportions of the poorly-performing subclasses for the second stage of their procedure (i.e. the model trained using these clusters as groups for the grouped distributionally robust optimization (GDRO)) to substantially improve performance on each of these subclasses. If the clusters had low precision didn't, the GDRO would have very noisy group labels to work with.

What is "better-than-random" precision in clustering?

1 Answers1