I am doing an information retrieval task where the goal is to estimate the total number of positive documents as opposed to which document is positive or negative.
My approach has been focusing on building a Logistic Regression classifier, using it to identify each document as "positive" or "negative" and then summing the number of total positive documents. But, it dawned on me that I actually don't need to build a great classifier, because I should be able to calculate the real number of positive documents in a set if I know the classifier's precision and recall:
num_real_positives = num_classified_as_positive * (classifier_precision/classifier_recall)
My questions:
1- What are the pitfalls of such an approach? What am I missing?
2- Has this method been credibly used, and what research I can look at that might be relevant?