2

I have 30x 1-hour sessions where I assign one random person to each session to search for a specific type of content on a social media website. I record the number of pieces found in each session, resulting with 30 data points. From this, I can calculate my mean, but how do I calculate the confidence interval of that mean?

The individuals during the session can perform as many searches or look at as many pieces of content as they wish within that 1-hour session. We do not track the number of searches nor the number of pieces of content checked. We only record the content count found that meet our request.

My thought is that this is a Poisson process and we can assume a Poisson distribution. However, our mean doesn't always equal our variance. Here are my questions:

  • If Poisson distribution is OK to use, is it correct that my 95% confidence interval would be λ ±1.96*sqrt(λ/30) (30 being the number of sessions)?
  • What should I do if not all conditions are met for a Poisson distribution? Mean ≠ var.
  • Can we apply a standard normal distribution on this kind of problem?
  • Is 30 sessions is statistically significant.
Cassova
  • 123
  • a) About CI for Poisson λ: If you invert the approximate normal test for $n\lambda$ based on the total $T\sim \mathsf{Pois}(n\lambda)$ of $n$ observations, you get $T+2±1.96\sqrt{T+1},$ (conflating 1.96 with 2), which seems to come somewhat closer to 95% coverage than Wald-style $T±1.96\sqrt{T}.$ Somewhat analogous to Agresti CI for binomial $p.$ These can be converted to CIs for $\lambda$ by dividing enpoints by $n.$ (b) More generally: Have you considered dropping the Poisson assumption and using a bootstrap CI? – BruceET Oct 28 '21 at 00:38
  • See This Q&A for coverage probabilities of the two styles of CI In Comment (a). – BruceET Oct 28 '21 at 00:54
  • 3
    I guess you meant the mean and the variance are identical in Poisson distribution (not the standard deviation). – Pitouille Oct 28 '21 at 11:25
  • yes. thank you. corrected. – Cassova Oct 29 '21 at 06:28

1 Answers1

6

Here are examples, using R, of the three kinds of confidence intervals mentioned in my comment (a), based on 30 observations from $\mathsf{Pois}(\lambda=50).$ [The nonparametric bootstrap CI uses the Poisson data, but without 'assuming' they are Poisson.]

set.seed(2021)
x = rpois(30, 50)
summary(x); sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.00   44.25   50.50   50.37   58.50   62.00 
[1] 8.405595  # sample SD; pop SD is 7.0711

a = mean(x); a [1] 50.36667 t = sum(x); t [1] 1511

stripchart(x, meth="stack", pch=20) abline(v = a, col="red")

enter image description here

Wald and 'Agresti-style' 95% confidence intervals for $\lambda.$ [Start with intervals for $n\lambda,$ then divide by $n.]$

CI.1 = (t + qnorm(c(.025,.975))*sqrt(t))/30;  CI.1
[1] 47.82710 52.90623  # Wald CI

CI.2 = (t+2 + qnorm(c(.025,.975))*sqrt(t+1))/30;
CI.2 [1] 47.89293 52.97374 # CI from inverting test

d.re = replicate(2000, mean(sample(x,30,rep=T))-mean(x)) UL = quantile(d.re, c(.975,.025)) mean(x) - UL 97.5% 2.5% 47.46667 53.33333 # nonparametric bootstrap CI

Addendum: By contrast, at each iteration, a parametric bootstrap re-samples from a Poisson distribution with the observed mean 50.36667 from the original data as its mean. Thus the assumption is made that data are from a Poisson distribution.

set.seed(2021)
x = rpois(30, 50)
a = mean(x);  a
[1] 50.36667
d.re = replicate(4000, mean(rpois(30, a))-mean(x))
UL = quantile(d.re, c(.975,.025)) 
mean(x) - UL
   97.5%     2.5% 
47.73333 53.03333     # parametric bootstrap CI

Note: When used with Poisson data all four styles of CIs give about the same result. However, the nonparametric bootstrap CI is the only one of the four methods shown that would work for data that is not Poisson.

BruceET
  • 56,185