2

I am doing an information retrieval task where the goal is to estimate the total number of positive documents as opposed to which document is positive or negative.

My approach has been focusing on building a Logistic Regression classifier, using it to identify each document as "positive" or "negative" and then summing the number of total positive documents. But, it dawned on me that I actually don't need to build a great classifier, because I should be able to calculate the real number of positive documents in a set if I know the classifier's precision and recall:

num_real_positives = num_classified_as_positive * (classifier_precision/classifier_recall)

My questions:

1- What are the pitfalls of such an approach? What am I missing?

2- Has this method been credibly used, and what research I can look at that might be relevant?

2 Answers2

1

This approach might involve some circular logic, because precision depends on prevalence.

(Let's assume that you calculate the precision and recall using a separate data set that you didn't use to train on. Let's call that data the test set.)

The precision that you calculate depends on prevalence in your test set, so I think this approach depends on the prevalence in your test set being similar to the prevalence in this new dataset. But the prevalence in the new dataset is what you're trying to estimate.

Classifiers do not have a static property of precision; their precision depends on the prevalence in the data they are used in. Let's say you are predicting a very very common condition. Then when your model predicts positive, the probability of it being a true positive is higher. As the prevalence increases, the PPV also increases.

So then if you calculate precision on your test set, but then use this precision on a different dataset, there may be a problem.

Another issue:

The precision and recall that you calculated are really estimates of the precision and recall, because they are calculated based on a sample (the test set). In order to get a precise estimate of the number of real positives, which is your goal, your estimates of precision and recall must be precise. Make sure your test set is big enough to get precise estimates of the precision and recall. I think you need a couple hundred in each class.

0

Precision is a function of sensitivity/recall, prevalence, and specificity.

$$\text{Precision} =\dfrac{ \text{sensitivity}\times\text{prevalence} }{ \text{sensitivity}\times\text{prevalence} + \left[ \left( 1 - \text{specificity} \right)\times\left( 1 - \text{prevalence} \right) \right] } $$

Therefore, the algebra does not support a unique prevalence, given a precision and a recall (sensitivity). You also need the specificity.

Dave
  • 62,186