5

In a binomial experiment, I have an estimate for the probability of 3 independent events A, B & C, each with a 95% confidence interval.

(Trivial example values)

P(A) = .12 (.05, .29)
P(B) = .16 (.08, .25)
P(C) = .06 (.02, .14)

I need to calculate P (no event) = P (no A) * P (no B) * P (no C)

which is (1 - P(A)) * (1 - P(B)) * (1 - P(B)), or (1 - .12) * (1 - .16) * (1 - .06).

Now, my question arises when I do the same calculation using the lower and upper bounds of the confidence intervals to calculate a CI around P (no success). It seems logical to do it, but I know that in some circumstances, you can't just add or subtract lower or upper bounds of C.I.'s without affecting the width, or rather the confidence level of your newly calculated interval. (Adding two 95% C.I.'s would lead to a close to 98% C.I., I've read somewhere recently).

I'm just not sure if this is one of those circumstances, and if it is, how do I find / calculate the proper confidence level (85%? 90%?) to use in the first step in order to end up with a truly 95% C.I. at the end?

EDIT: This is an epidemiological study. Sample proportions for A, B, and C were obtained from the same sampled individuals; however, the three events are assumed independent (finding A does not impact chance of finding B in the same individual).

  • 1
    Please provide more information about the nature of the data used to find these confidence intervals. In particular, ¿is one sample used to estimate the P(A) and a different sample used to estimate P(B), etc.? Or, ¿do you have occurrence data for A, B and C on each individual? – Gregg H May 21 '23 at 17:56
  • I changed (no success) to (no event), I think it is clearer now. – Dominic Comtois May 23 '23 at 00:04

3 Answers3

2

Another option is to estimate the minimum-width credible interval of the posterior probability of observing none of the three events in a subsequent sample. This can be done quickly using sampling.

After observing $k_i$ occurrences of event $i$ out of $n$ samples, with event $i$ occurring independently with probability $1-\theta_i$, the posterior distribution of $\theta_i$ is:

$$p(\theta_i)\propto\text{Binomial}(n-k_i;n,\theta_i)\pi(\theta_i)$$

Where $\pi(\theta)$ is the prior distribution of $\theta_i$. Using, e.g., Jeffreys' prior on $\theta_i$, this becomes:

$$\theta_i\sim{\text{Beta}(n-k_i+1/2,k_i+1/2)}$$

The probability of observing none of the $i$ events is

$$\theta_0=\prod_{i=1}^3\theta_i$$

By sampling $\theta_0$, we can get an estimate for the credible interval of $\theta_0$. It takes a fraction of a second to produce $10^6$ samples in R:

k <- c(12, 16, 6) # example number of occurrences
n <- 100 # example number of samples
m <- length(k)

sample theta0

theta0 <- sort(Rfast::colprods(matrix(rbeta(m*1e6, n - k + 0.5, k + 0.5), m)))

get the intervals that contain 95% of the samples

int <- rbind(theta0[1:50001], theta0[950000:1e6])

find the minimum-width interval

int[,which.min(diff(int))] #> [1] 0.6011458 0.7701781

compare to the equal-tailed interval

int[,25001] #> [1] 0.5983814 0.7676191

jblood94
  • 1,459
2
  • You have estimates $\hat{q}_a$, $\hat{q}_b$ and $\hat{q}_c$, which are (presumably) approximate independent estimates of the probabities for the independent events 'no A', 'no B' and 'no C'.

  • You have related standard errors for these estimates (which can be derived from confidence intervals).

  • You want to compute an estimate for the product $q = q_aq_bq_c$, the probability of neither A, B and C, assuming a model where they are independent.

You can estimate this by $$\hat{q} = \hat{q}_a\hat{q}_b\hat{q}_c$$

For the standard deviation, and associated confidence interval, you can use as approximation of propagation of errors the formula for the variance of independent variables when they are multiplied.

$$\sigma_{XYZ}^2 = \mu_{X}^2 \mu_{Y}^2 \sigma_{Z}^2 + \mu_{X}^2 \sigma_{Y}^2 \mu_{Z}^2 + \sigma_{X}^2 \mu_{Y}^2 \mu_{Z}^2 + \mu_{X}^2 \sigma_{Y}^2 \sigma_{Z}^2 + \sigma_{X}^2 \mu_{Y}^2 \sigma_{Z}^2 + \mu_{X}^2 \sigma_{Y}^2 \sigma_{Z}^2 + \sigma_{X}^2 \sigma_{Y}^2 \sigma_{Z}^2$$


Simulation

I did a simulation when $n=100$ and $p_a=p_b=p_c=0.5$, and interestingly, computing $\hat{q}$ indirectly via $\hat{q}_a\hat{q}_b\hat{q}_c$ leads to a smaller variance of the estimate, in comparison to using the raw data directly (counting the cases no a, no b and no c). It is because we are using effectively more data, 300 datapoints instead of 100.

simulation for difference of methods

So the indirect estimate using the product $\hat{q} = \hat{q}_a\hat{q}_b\hat{q}_c$ has less variance than using counts of the events directly. But, potentially it might biased when the events a,b,c are not truly independent.

  • Thanks for this. If I may ask, how did you get from the formula for variance (referenced on the other post) to this one? I don't see how the subtracted product of E()'s got cancelled out. – Dominic Comtois May 24 '23 at 22:13
  • 1
    @DominicComtois I used $$\prod_{i=1}^3(\sigma_i^2+\mu_i^2) - \prod_{i=1}^3 \mu_i^2$$ which gives 9+1 terms from which two cancel each other ending up with 8 terms. You get each combination of products of $\mu^2$ and $\sigma^2$ except the one where three $\mu^2$ are multiplied. – Sextus Empiricus May 25 '23 at 05:29
  • Alternatively you apply two times the formula for independent variables $$Var(UV) = Var(U)Var(V) + Var(U)E(V) + E(U)Var(V)$$ – Sextus Empiricus May 25 '23 at 05:44
1

It sounds like you have data at an individual level, so you ought to be able to directly identify the count of people for which all three events are true. I would caution you against assuming independence, since it almost never holds in these types of studies. Instead, go back to your raw data and count the number of individuals for which all three events are true and then use a standard binomial confidence interval (e.g., Wilson score) to estimate the true joint probability of all three events.

Ben
  • 124,856
  • I agree. The sample size is not large enough to capture cases of all three events being positive, that's why I'm looking for an alternative. – Dominic Comtois May 24 '23 at 21:59
  • If the sample size is not large enough to capture such cases, then you should proceed with this CI anyway, and let the resulting inference reflect the high level of uncertainty that comes from having such a sample (see here for CI where there are zero instances of a binary outcome in the sample). Do not use other methods that fool you into having greater confidence in an inference than you should. – Ben May 25 '23 at 03:43