1

I'm a lowly physicist, so I hope you will forgive me if I botch some terminology and notation here.

I perform an experiment where on each trial I start with some number of particles $N$ that is approximately Poisson distributed, say $N \sim {\rm Pois}(\overline{N})$. I then perform some action that bins these particles into two groups, $a$ and $b$, such that $N_a \sim B(N,p)$ and $N_b = N-N_a$. The value of $p$ is unknown to me. I know how to calculate ${\rm Var}[N_a]$, ${\rm Var}[N_b]$, and ${\rm Cov}[N_a,N_b]$ using the laws of total variance and covariance. I find $${\rm Var}[N_a] = \overline{N} \, p(1-p) + p^2 {\rm Var}[N],$$ $${\rm Var}[N_b] = \overline{N} \, p(1-p) + (1-p)^2 {\rm Var}[N],$$ $$ {\rm Cov}[N_a,N_b] = -\overline{N} \, p(1-p) + p(1-p) {\rm Var}[N]. $$

Based on measurements of $N_a$ and $N_b$, I would like to estimate the quantity $Y=2p-1$. I would like to consider two scenarios; one in which I can measure both $N_a$ and $N_b$ in a single experiment (i.e., for a single value of $N$), and another in which I cannot. My question pertains to the choice of estimator for $Y$ in the second scenario.

  1. If I can measure both $N_a$ and $N_b$ in a single trial, I can estimate $Y$ by $$ \hat{Y} = \frac{N_a-N_b}{N_a+N_b}, $$ and approximating the variance in $\hat{Y}$ by a series expansion, I find $$ {\rm Var}[\hat{Y}] = 4 \frac{N_b^2 {\rm Var}[N_a] + N_a^2 {\rm Var}[N_b] - 2 N_a N_b {\rm Cov}[N_a,N_b]}{(N_a+N_b)^4} $$ which gives $$ {\rm Var}[\hat{Y}] = \frac{4p(1-p)}{\overline{N}}. $$ As I understand from this SE post, this is the expected variance for the MLE of $\hat{Y}$. This estimator of $\hat{Y}$ normalizes out fluctuations in $N$; nowhere does ${\rm Var}[N]$ appear in this expression. This is desirable, from my point of view.

  2. What if I can only measure one of $N_a$ or $N_b$ in a single trial of my experiment? I can imagine several ways of estimating $Y$, for example I could simply use the same estimator as above, but if $N_a$ and $N_b$ do not derive from the same sample of $N$, they are uncorrelated and ${\rm Var}[N]$ no longer drops out of ${\rm Var}[\hat{Y}]$. One reasonable possibility seems like $$ \hat{Y}_0 = \frac{N_a - N_b}{\overline{N}}, $$ where I estimate $\overline{N} = \overline{N_a}+\overline{N_b}$ from a sample mean over several trials. I can also dream up something goofy like $$ \hat{Y}_1 = \frac{N_a}{N_a + \overline{N_b}} - \frac{N_b}{\overline{N_a}+N_b}, $$ which from some MC simulations I've performed, seems to be biased, but it's not clear to me why.

My question is, in scenario 2, which estimator for $Y$ should I choose, and why? I would like to estimate $Y$ with small bias and with the minimum variance possible.

Edit: I do not know $\overline{N}$ ($=\lambda$); I can only measure $N_a$ and $N_b$. If my best estimator for $p$ is $N_a/\lambda$, how should I estimate $\lambda$?

Will C
  • 119
  • 3
  • How many trials will you perform? just one? Under scenario 2, could you get the sum of N if multiple trials will be performed? – user158565 Jun 08 '19 at 02:30
  • I can perform many trials, and in scenario 2 I typically measure N_a on one trial, and N_b on the next. I do some large number of trials, and I suppose my best estimate for lambda is N_a + N_b, averaged over trials. – Will C Jun 08 '19 at 03:23
  • Google 'splitting' and 'thinning' Poisson processes as here. – BruceET Jun 08 '19 at 11:52

2 Answers2

1

Suppose $N \sim \mathsf{Pois}(\lambda).$ (I'm using $\bar N = \lambda.)$ According to your description it seems you observe $N_a \sim \mathsf{Pois}(p\lambda).$ Then $E(N_a) = p\lambda,$ so $\hat p = N_a/\lambda$ is the natural estimator $p.$

One reasonable 95% confidence interval for $p\lambda$ is $$N_a + 2 \pm 1.96\sqrt{N_a + 1}.$$ Dividing by $\lambda$ we get the endpoints of the above 95% CI for $p.$

The CI for $p\lambda$ is based on a normal approximation, which works reasonably well if $N_a \ge 50.$

For example if $\lambda = 500$ and $N_a = 288$ then the CI for $p\lambda$ is $$290 \pm 1.96\sqrt{289}.$$ Dividing by $500$ we get $(0.513, 0.647)$ as a 95% CI for $p$ and point estimate $\hat p \approx 0.58.$

If events are rare so that you have much smaller $N_a,$ then you need to look into a CI for $p\lambda$ that does not use a normal approximation.

BruceET
  • 56,185
  • I'm not sure I understand the question, but I'll try to answer: I create an unknown number of particles N on each trial. I then "turn them" into a-type with probability p. All that are left are b-type. The problem is that I can't measure both N_a and N_b on a given trial. – Will C Jun 08 '19 at 03:25
  • The typical N, although I can't measure it directly, is around 1000. The probability p actual varies as a function of other parameters anywhere between 0 and 1. – Will C Jun 08 '19 at 03:41
  • Thanks for additional info. I simulated several models, all with results consistent with my Answer. I'm taking $\lambda = \bar N$ as a constant, $N$ and $N_a$ as random variables. Will look at it again tomorrow. – BruceET Jun 08 '19 at 04:25
1

Scenario 2:

Suppose you perform $n = n_1 + n_2$ trials and observe $N_a$ in $n_1$ trials and $N_b$ in $n_2$ trials. No trial has both $N_a$ and $N_b$. Trails are independent.

Let $N_A = \sum N_a$ and $N_B=\sum N_b$. Then $N_A \sim Poisson(n_1p\bar N)$ and $N_B \sim Poisson(n_2(1-p)\bar N)$

Then you can right the joint log likelihood of $N_A$ and $N_B$, and get the $\hat p$ MLE of $p$ and its CI. It is straightforward to get $\hat Y$ and its CI, because $Y$ is linear function of $p$.

user158565
  • 7,461