3

I am given a bag containing marbles of two colors, with an unknown total number of marbles $N$. I randomly sample $n$ marbles ($n=n_1+n_2 < N$, where $n_1$ and $n_2$ are the number or marbles of the two colors) from the bag, without replacement. The likelihood of the true number of the two colors of marbles in the bag, $N_1$ and $N_2$, is $p(N_1,N_2|n_1,n_2)=\frac{\binom{N_1}{n_1} \binom{N_2}{n_2}}{\binom{N_1+N_2}{n_1+n_2}}$. If I make many repeated samplings from the same bag (i.e. return all marbles and draw randomly again, potentially with different $n$), is it possible to accurately estimate $N_1$ and $N_2$?

I have tried maximum likelihood by optimizing a continuous form of the log-likelihood (replacing the factorials with gamma functions), but I have found that the final estimate depends on the initial estimate, instead of always converging to the correct values. Are there any better ways to approach this problem or is there a fundamental limitation?

jpg0101
  • 31
  • 1
    How are you defining accurately? Estimates are possible, certainly.but I don't know what your requirements to count as accurate might be, or even how you're measuring it. It may be hard to estimate N=N1+N2 well depending on. The specific situation, but p=N1/N may be easier; even then relative and absolute error behave very differently with small p. – Glen_b Jul 08 '23 at 05:06
  • 3
    There is no clear difficulty in handling the likelihood as a function of two integer variables. – Xi'an Jul 08 '23 at 14:29
  • I guess before considering accuracy I need to understand if there's a unique solution. My simulations converge to different values depending on the initial condition, which suggests that it's only possible to estimate $N_1/N$ (multinomial). But it may also be an issue with my code. I've only found literature discussing maximum likelihood for hypergeometric when N is known, but that might be because it results in a closed form solution. – jpg0101 Jul 08 '23 at 18:23
  • 2
    This has the flavour of https://stats.stackexchange.com/questions/123367/estimating-parameters-for-a-binomial/123748#123748, so read that to see if you can adapt? – kjetil b halvorsen Jul 10 '23 at 05:21

2 Answers2

2

If you have only one sample, then the maximum likelihood estimator would be $N_1=n_1$ and $N_2 = n_2$ - this may not make much sense, and you do have more than one sample.

Unfortunately, for more than one sample, a maximum likelihood estimator does not necessarily exist, see the counterexample below. So yes, there is a fundamental limitation, and your problems with convergence may be related to non existence of an ML estimate.

As @Glenn_b commented, other quantities like $N_1/N$ are easier to estimate. That is because the sampling distribution of your pairs $(n_1,n_2)$ depends mainly on the proportions $N_1/N$ and $N_2/N$ of marbles of color 1 and 2, and only little on $N$, at least, if $N$ is considerably larger than $n$.

Counterexample for existence of an ML estimator

Sample twice with sample size $n$, and obtain $n_1=0$ in the first sample, and $n_1=n$ in the second sample. The likelihood is then $$ L(N_1, N_2) = p(N_1, N_2\mid 0, n)\cdot p(N_1, N_2\mid n, 0) =\frac{\binom{N_1}{0}\binom{N_2}{n}\binom{N_1}{n}\binom{N_2}{0}} {\binom{N_1+N_2}{n}^2} $$ If $N$ is even, this likelihood is maximized for $N_1=N_2 = N/2$, otherwise for $\{N_1,N_2\}=\{\lfloor N/2\rfloor,\lfloor N/2\rfloor+1\}$. For simplicity, write $N_1=N_2=M$. Then $$ \lim_{M\to\infty}L(M, M) = \lim_{M\to\infty}\left(\binom{M}{n}\bigg/\binom{2M}{n}\right)^2 = \frac{1}{2^{2n}}, $$ where the limit is approached from below. Therefore, there exists no finite $M$ for which $L(M, M)$ is maximal. In the same way, you can show that there is no finite $M$ for which $L(M, M + 1)$ is maximal. Therefore, there does not exist a maximizer for $L(N_1, N_2)$, not one with even sum $N_1 + N_2$, nor with odd sum $N_1 +N_2$.

Ute
  • 2,580
  • 1
  • 8
  • 22
2

Just to echo @Ute 's answer, the maximum likelihood estimator can sometimes not exist even when there are multiple samples and none of the values of $n_1$ and $n_2$ are zero.

Suppose we take 5 samples of size 6 and end up with the following likelihood:

$$ L(N_1,N_2)=\frac{\binom{N_1}{3}\binom{N_2}{3} \times\binom{N_1}{3}\binom{N_2}{3} \times\binom{N_1}{2}\binom{N_2}{4} \times\binom{N_1}{1}\binom{N_2}{5} \times\binom{N_1}{5}\binom{N_2}{1}} {\binom{N_1+N_2}{6}^5} $$

The log of the likelihood simplifies a bit to

$$-5 \log ((N_1+N_2-5) (N_1+N_2-4) (N_1+N_2-3) (N_1+N_2-2) (N_1+N_2-1) (N_1+N_2))+\\\log (N_1-4)+\log (N_1-3)+3 \log (N_1-2)+4 \log (N_1-1)+5 \log (N_1)+\log (N_2-4)+\\2 \log (N_2-3)+4 \log (N_2-2)+4 \log (N_2-1)+5 \log (N_2)+\log (216000)$$

Maximum likelihood estimates (likely) don't exist with evident being with Mathematica's FindMaximum function I get 283,525 and 324,028, respectively, for $N_1$ and $N_2$. Also, a contour plot of the log of the likelihood (ignoring the discreteness of the situation) shows the likelihood still increasing past the previous estimates:

Contour plot of log of likelihood

If you are a Bayesian with reasonable priors for $N_1$ and $N_2$, I'm sure you'll do much better.

JimB
  • 3,734
  • 11
  • 20
  • Yes, a point for Bayesians and a nice graph. I also thought of suggesting Bayes analysis, but what would happen if you use an uninformative prior? After all, Bayesian analysis is also likelihood based. And yes, you can find hundreds of such examples, I was just after a very simple one :-) – Ute Jul 12 '23 at 19:58
  • @Ute Good. Yes, I assumed you were just going after a more simply understood example to make the point. My attempt was just to show an example that mle's not existing was not due to some extreme sample: it happens frequently. As suggested by Glen_b it appears that reparameterizing with $p=N_1/N$ and $N=N_1+N_2$ the mle for $N$ won't necessarily exist but the mle for $p$ probably exists much of the time. In fact for equal sample sizes, it's probably the sum of the $n_1$ values divided by the product of the sample size times and number of repeated samples. – JimB Jul 12 '23 at 20:14
  • 1
    Yes, the mle for $p$ exists. But OP wanted to know the total number - this is ill posed, unfortunately. Another thing is a capture-recapture situation, where $N$ can be retrieved because $N_1$ is known. – Ute Jul 12 '23 at 20:17
  • @Ute Or maybe weighing the bag and marbles? (In the unlikely event that this actually about marbles). – JimB Jul 12 '23 at 21:59
  • 1
    brilliant idea, JimB! Using extra information. This can be transferred to other more real things like fish, birds, butterflies, money or aliens ;-) – Ute Jul 12 '23 at 22:29
  • I also visualized the log-likelihood to confirm that the maximum likelihood estimate appears to be a diagonal line with slope $p$. However, the continuous form of the log-likelihood is a sum of $\log \Gamma (x)$ functions, which are convex, so I'm surprised there is no unique solution. – jpg0101 Jul 17 '23 at 18:01