Ultimately this is not a precision/recall calculation question but rather a sample labelling/data quality one. To that extent, there are no silver bullets. If we are unwilling to invest the time we cannot expect to get the appropriate answer. I would suggest starting with a literature review to construct a loose prior about the proportion $p$. Then we do ahead with random sampling and we update that posterior as we move forward.
As already mentioned, precision and recall might not be the best metrics to employ. In an imbalanced learning setting, assigning relevant misclassification costs can be very helpful. This can help you choose the threshold for the actual label assignment given (well-calibrated) probabilistic estiamtes.
Regarding the "smart sampling methods" commentary:
While counter-intuitive my suggestion would be to avoid them for now. Stick with simple random sampling (SRS) in order to have an unbiased baseline. It is tempting to start thinking about active learning or other domain adaptation methods applied to sample bias correction. These methods can be great if we are trying to indeed find positives and/or find a way to distinguish positive from negative examples (in short, we would try to sample near our classifier's decision boundary). They would nevertheless give us a biased view of what is the true baseline rate without a way of de-biasing it if we do not already have access to some unbiased dataset. That is not "ML's fault", even more classical statistics methods to debias convenience samples/observational data (e.g. importance sampling or post-stratification) will need to have some known population/reference distribution to work with. If that is unavailable, their applicability is moot. Therefore start with SRS (or some relatively straightforward sampling strategy) to build that reference distribution and then do something fancier.