Compute the parameters of a distribution from samples drawn from the tail of the distribution

Question

I have a collection of N samples drawn from a population by the following process:

Draw M samples from the population, where each sample is a non-negative real number
Drop all but the N samples with the highest values, where N << M.

Note that N can be quite large (up to a few thousand), while M is even larger (on the order of millions). I don't have access to the samples drawn in step (1), just the samples in step (2), but I can estimate the value of N/M (or compute it exactly, but that requires more computational effort). Assuming the population follows a suitable distribution such as a truncated Normal distribution or an exponential distribution, I'd like to estimate the parameters of the distribution from this biased sample. What are some techniques to do this?

This question is not a self-study or homework question, but is part of a problem I need to address for work. I've omitted the details of the problem to avoid giving away proprietary information about our internal processes and products. The process that produces the N samples takes an input X, effectively identifies a set of M candidates and then computes non-negative real-valued scores for the subset of N items with the highest scores without computing the values for all M candidates (the structure of the problem and the scoring mechanism allow for us to progressively prune sets of items as not possibly belonging to the top N without having to score them fully). Both the set of M candidates and the score for each candidate depends upon the input X. Computing scores for all M items is possible, but (very) computationally expensive, and the downstream processes that consume the output of this process only need the highest-scoring N items. Besides allowing us to calibrate scores so they are comparable across different inputs in a principled manner, understanding the relationship between X and the parameters of the score distribution will be useful as well.

I've tried searching for an answer to this question on Google, but I haven't been able to find the right search terms, so I decided to ask here.

How precisely do you know the ratio of N to M? It'll be tricky if you don't know if your data represents the top 1% of a normal distribution, or the top 0.1%, or the top 0.01%. — Nuclear Hoagie, Jul 11 '19 at 20:48
How is 'with the largest values' defined? Largest maxima? Based on some quantile? Something else? — Glen_b, Jul 12 '19 at 04:01
Is this for a school assignment? If so, please add the self-study tag and show what you have done so far. — Peter Flom, Jul 12 '19 at 11:47
I edited the question to answer the comments above, but I will summarize those edits here as well. I tried to summarize them in the box provided in the edit form, but I was not familiar with the interface and hit enter too early.
(1) I can estimate N/M or compute it exactly, although the latter is more computationally expensive than the former. (2) The scores for each item are non-negative real numbers (I thought I wrote that, but it got cut off). 'Largest values' means 'highest values.' (3) This question is for work. I left out details to prevent disclosing proprietary information. — Tom Ault, Jul 12 '19 at 15:24

Compute the parameters of a distribution from samples drawn from the tail of the distribution

0 Answers0