5

I need to understand how a particular statistical challenge has been formally recognised or is commonly described in literature, and what the best academic resources are that discuss it. Here's the problem: consider you need to detect a rare phenomenon and in classifying the whole population (no need to sample) to detect it you want to consider the effects of different hypothetical rates of classification error. To illustrate assume the following hypothetical reality:

  • total population (N) = 1 million
  • rare phenonon of interest is present in 1 in 1,000 of the population (P) and absent in the remaining 999,000 (A)

A 100% accurate classification would therefore yield 1,000 true positives and 999,000 true negatives. But we cannot assume our classification is perfect. But let's assume our false negatives are relatively few (certainly having a negligible effect on the estimate for A). The real problem is the effect that even a tiny rate of false positive classification will have on estimation of P, since if just 1 in 1,000 of A observations are misclassified the estimate for P is doubled.

Grateful for any insight community can offer for how this has been formalised and where I can find more information. It feels like the domain of Bayesian statistics which I'm not very expert in.

geotheory
  • 647
  • 2
  • 5
  • 15
  • You need to start by asking if the misclassification costs are unequal, i.e. are false-positive errors more costly than false-negative ones, and if so by how much. It may be that assigning all patterns to the majority class is the optimal solution in terms of minimising the expected cost, in which case just do that, but it also may allow meaningful classification. See https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance . However, even in that case you can still estimate tendencies as @FrankHarrell suggests. – Dikran Marsupial Nov 11 '23 at 23:03
  • 1
    The body of the question seems inconsistent with the title. Do you just want to estimate the prevalence of the phenomena in the population or do you want to identify the cases of that phenomena in the population? – Dikran Marsupial Nov 11 '23 at 23:06
  • I should clarify the context here is establishing confidence in existing classifications in our data based on the way it has been collected. Let's say we are dealing with self-reported survey data. – geotheory Nov 11 '23 at 23:13
  • I have probably framed the question a bit misleadingly - apologies for that. I have tried to formulate the question in general terms, but that might have caused confusion. – geotheory Nov 11 '23 at 23:16
  • what do yo mean by "establishing confidence"? – Dikran Marsupial Nov 11 '23 at 23:17
  • Let's say we have survey of self-reported classifications along the lines of the example in the question. If 1 in 1000 of majority group have misunderstood or misreported (a plausible scenario) then our estimate for minority population is totally unreliable. So by "establishing confidence" I mean evaluating how trustworthy our data is for the purpose of population estimation of P. – geotheory Nov 11 '23 at 23:22
  • Ah, I see. I take it you don't have any ground truth classifications for any of the subjects? If that is the case, I don't think you can do much without making assumptions about the misclassification rates of the "classification", which would lead to a circular argument as that is essentially what you are trying to estimate. – Dikran Marsupial Nov 11 '23 at 23:28
  • 1
    Exactly, we only have the data. But it seems to me we should still be able to make some assessment about reliability of the minority subgroup's population estimate based on plausible hypothetical rates of misclassification and given the ostensible size difference between majority and minority subgroups that the data is telling us. It seems to me this has to be a well established problem in statistics, which is what I'm trying to pin down. – geotheory Nov 11 '23 at 23:37
  • 1
    I'm not sure how that would establish confidence in the classification as it would just be looking at the effect of your confidence in the classifier ("plausible hypothetical rates") on the population probability. Of course you could estimate them using a simulation study (estimate the true numbers of positive and negatives from the assumed misclassifcation rates) but it wouldn't be meaningfully establishing confidence AFAICS – Dikran Marsupial Nov 11 '23 at 23:44
  • I understand. Fortunately I'm not trying to pin down the truth. I'm just trying to establish if this problem (probable effect of misclassification on pop estimation for small minority subgroups) has been formally recognised. But I might repost this as a new question framed around survey data. – geotheory Nov 12 '23 at 00:45
  • The problem with the situation you describe in comments sounds a bit like assessing the reliability and validity of your survey questions (e.g., does the way the question is phrased bias the way respondents answer?). If this is what you're talking about, this is different from assessing errors due to randomness. Ideally, this is something that should have been assessed at the design stage of the survey (I know, not very useful to say that now the data have been collected!). – J-J-J Nov 12 '23 at 13:26
  • This article might help you: https://www.hindawi.com/journals/eri/2011/608719/ ("Estimating Prevalence Using an Imperfect Test", Peter Diggle, 2011) – J-J-J Nov 12 '23 at 18:28

2 Answers2

6

If you just want to estimate the prevalence of the phenomenon in the population, rather than identify the the particular instances of the phenomenon, then I would use e.g. logistic regression to make a probabilistic classifier that estimates the probability of class membership and then apply the algorithm of Saerens et al. which can be used to adjust the output of a probabilistic classifier to compensate for unknown operational class frequencies that are not the same as those in the training data (it is an EM-type algorithm). As a by product it gives you an estimate of the operational class frequencies.

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204
5

This topic has been covered to the extreme on this site. Briefly, when either class A or class B are extremely rare, classification is an inappropriate goal because as you have so rightly noted you can just classify everyone as being in the majority category and it would be hard for any classifier to beat the accuracy of doing that. The inappropriateness of classification is discussed in detail here. Instead the goal is to estimate tendencies, i.e., probabilities, using a direct probability model such as binary logistic regression. Probabilities also form the basis for optimum decision making and for withholding action if probabilities are in the gray zone.

This is not a Bayesian vs. frequentist issue.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • The hbiostat.org link is dead is there an updated URL? – geotheory Nov 11 '23 at 23:07
  • I think I need to clarify that the context here is not statistical modelling and I'm not talking about performance of a classifier per se but rather the existing classifications in our data. This is about the confidence we can have in our data based on the way it has been collected. Let's say we are dealing with self-reported survey data. – geotheory Nov 11 '23 at 23:10
  • Sorry - link fixed now. – Frank Harrell Nov 12 '23 at 13:01