2

I have a finite set $X$ of unknown size. From that I independently draw a multiset of samples $S\subset X$. For each sample $s\in S$ I can give the exact probability $p(s)$ with which it has been drawn from $X$ such that $\sum_{x\in X}p(x)=1$ and how often it has been drawn. I also have a lower bound $l\le p(s)$ for the probability of each sample.

Given this information, how can I estimate the size of $X$?

A mark and recapture approach is not going to work because $|S| \ll |X|$ by several orders of magnitude. I have drawn about a billion of samples so far without encountering any duplicates.

fuz
  • 221
  • 1
    Statement not clear (to me anyhow). Perhaps look at mark-recapture – BruceET Oct 08 '19 at 16:13
  • @BruceET The population $X$ is much larger than the sample size I can reasonably achieve. I have so far drawn a few billion samples without any duplicates, so I don't think mark-recapture is feasible. What part of the statement is unclear? – fuz Oct 08 '19 at 16:20
  • "For each sample $S∈s$ I can give the probability $p(S)$ it has to be drawn from $X$ such that ..." Maybe give examples from your work. How do you get the probabilities? Exact or estimated? – BruceET Oct 08 '19 at 16:28
  • @BruceET I can compute these probabilities exactly. I am working on the same problem this question is about (sampling vertices of a graph that have a given distance from a fixed vertex). Once I found a vertex with the right distance, I can exactly compute the probability with which it was found. – fuz Oct 08 '19 at 16:33
  • @BruceET Also, I'm sorry for the wrong spelling and grammar. I have fixed the post. – fuz Oct 08 '19 at 16:35
  • @BruceET Oh, a mark and recapture approach is likely not going to work because the base population is very large. I have so far not been able to get any duplicate samples at all. – fuz Oct 08 '19 at 16:49

1 Answers1

1

DISCLAIMER: Simulation has shown that this is not a good estimator but it might be helpful for further ideas

I propose the following estimator: With sample space $S$ and sample $s_1, ..., s_n$, we could use $\widehat{\#S}= n \times\frac{1}{\sum_{i=1}^np(s_i)}$ as an estimator for $\#S$

The theoretical idea is the following:

We have a random Variable $X$ that has the probability values $p_s:=p(s)$ as the sample space and a distribution, such that $P(X=p_s)=p(s)$, i.e. the numeric value of the sample and its probability coincide.

We then estimate the mean of this Random Variable $X$, i.e. $E[X]=\sum_{s\in S}p(s)^2$ by taking the sample mean $\sum_{i=1}^np(s_i)$ In addition to that we use the fact that $nE[X]=1 \iff n = \frac{1}{E[X]}$

So we use an unbiased estimator for $E[X]$ and plug it into the formula $n=\frac{1}{E[X]}$ (which means that the estimator for $\#S$ is not necessarily unbiased anymore.

Maybe we could also estimate the distribution of $X$ in order to get a correction-factor of our estimate that takes into consideration how much $1/E[X]$ deviates from $E[1/X]=n$.

Sebastian
  • 3,064
  • 14
  • 29
  • I do draw individual elements. $S\subset X$ is the set of samples I draw of which the probability $p(s)$ for each $s\in S$ is known. Here, $p(s)$ is the probability of drawing $s$ from $X$. – fuz Oct 08 '19 at 20:02
  • Ok then the last paragraph is unnecessary but the rest might still help you. I will simulate this procedure tomorrow to check how well it works and update my post. – Sebastian Oct 08 '19 at 20:07
  • I actually thought about something similar. We do know about the distribution of probabilities because we sampled them, so I guess there should be a way to make that exact. – fuz Oct 08 '19 at 20:32
  • Re your formula: although this statistic is an estimator, what properties does it have? Why should it be any good at all? I don't follow your remarks about "uniformly distributed," because if that assumption is true, then all you have to do is compute a single probability $p_i$ and recover the population size as $1/p_i.$ – whuber Oct 08 '19 at 21:09
  • As in the first line: "This is not a known method (at least to me) but just an idea that makes sense to me and might be worth exploring." – Sebastian Oct 08 '19 at 21:12
  • You were very close. Turns out this can be modeled as a size biased sample for which it holds that the harmonic mean of the biased sample equals the arithmetic mean of an unbiased sample over the same space. The quantity we sample is the probability and the average probability is the reciprocal of the size of the sample space. Thus, we can compute the sample space size by taking the reciprocal of the harmonic mean of probabilities as sampled. – fuz Oct 24 '19 at 15:52