I have a collection of N samples drawn from a population by the following process:
- Draw M samples from the population, where each sample is a non-negative real number
- Drop all but the N samples with the highest values, where N << M.
Note that N can be quite large (up to a few thousand), while M is even larger (on the order of millions). I don't have access to the samples drawn in step (1), just the samples in step (2), but I can estimate the value of N/M (or compute it exactly, but that requires more computational effort). Assuming the population follows a suitable distribution such as a truncated Normal distribution or an exponential distribution, I'd like to estimate the parameters of the distribution from this biased sample. What are some techniques to do this?
This question is not a self-study or homework question, but is part of a problem I need to address for work. I've omitted the details of the problem to avoid giving away proprietary information about our internal processes and products. The process that produces the N samples takes an input X, effectively identifies a set of M candidates and then computes non-negative real-valued scores for the subset of N items with the highest scores without computing the values for all M candidates (the structure of the problem and the scoring mechanism allow for us to progressively prune sets of items as not possibly belonging to the top N without having to score them fully). Both the set of M candidates and the score for each candidate depends upon the input X. Computing scores for all M items is possible, but (very) computationally expensive, and the downstream processes that consume the output of this process only need the highest-scoring N items. Besides allowing us to calibrate scores so they are comparable across different inputs in a principled manner, understanding the relationship between X and the parameters of the score distribution will be useful as well.
I've tried searching for an answer to this question on Google, but I haven't been able to find the right search terms, so I decided to ask here.
(1) I can estimate N/M or compute it exactly, although the latter is more computationally expensive than the former. (2) The scores for each item are non-negative real numbers (I thought I wrote that, but it got cut off). 'Largest values' means 'highest values.' (3) This question is for work. I left out details to prevent disclosing proprietary information.
– Tom Ault Jul 12 '19 at 15:24