6

I am having a conceptual hard time understanding where this formula came from. It does not seem to make any sense to me. Could someone shed some light on this:

The natural estimator of $p$ is $\hat p = \frac{X}{n}$, the same fraction of success. Since $\hat p$ is just $X$ multiplied by a constant, $\hat p$ has an approximately normal distribution. $E(\hat p) = p$ and $\sigma_{\hat p} = \sqrt{\frac{p(1-p)}{n}}$. Standardizing, this implies that:

$$P\left(-z_{\alpha/2} \lt \frac{\hat p- p}{\sqrt{p(1-p)/n}} \lt z_{\alpha/2}\right) \approx 1 - \alpha $$

Could someone derive this equation so that it makes sense to someone who's never seen this before?

  • 2
    Sure! Are you uncertain about the last formula or the lines before it as well. I.e., are you familiar with the binomial distribution, normal distribution and the central limit theorem? – MånsT Apr 13 '12 at 05:49

2 Answers2

7

The binomial distribution $B(n,p)$ is just the sum of of Bernoulli variables with sucess probability p. Therefore the Central Limit Theorem applies and if n is "large enough" you can approximate the binomial distribution by a normal distribution with the same mean np and the same variance $np(1-p)$. This means $\frac{X}{n}$ can be approximated by a normal distribution with mean $p$ and variance $\frac{p(1-p)}{n}$. The corresponding standard deviation is then given by $\sqrt{\frac{p(1-p)}{n}}$ .

The question then becomes when is n "large enough" and there are always lots of discussions about that. One of the shortcomings of the normal approximation is that it is always symmetrical and the binomial distribution is not for $p != 0$. This in turn has the effect that the confidence interval above may include values larger than 1 or smaller than 0, which obviously does not make sense. More seriously, it does not only include values that do not sense but it also does not have the stated coverage of $\alpha$.

There are other lots of discussion when the normal approximation makes sense, which depends not only on n but also on p. This Wikipedia Article http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval is a very good starting, which also discusses better alternatives. The normal approximation is still widely in use, because it usually does a decent jobs and because it requires little calculations, which was important in times before computers.

Erik
  • 7,249
  • 1
    I know you are right. However, how do you get $\text{VAR}(\widehat{p})=p(1-p)$? Shouldn't $\text{VAR}(\widehat{p})=\text{VAR}(\frac{X}{n})=\frac{1}{n^{2}}\cdot\text{VAR}(X)=\frac{1}{n^{2}}\cdot np(1-p)=\frac{p(1-p)}{n}$? This leaves us with an unwanted $n$ in the denominator. I know I am thinking about this the wrong way - can you point me in the right direction? – sparc_spread Apr 03 '14 at 05:29
  • 1
    @sparc_spread No, in fact you are correct and it is a typo that I left it out. You can see it reappear in the formula for the standard deviation. Will edit. – Erik Apr 03 '14 at 07:12
  • Thanks for fixing! Altogether, your answer has really helped me get better intuition around the sampling distribution and standard error of proportions. – sparc_spread Apr 03 '14 at 14:17
1

This is an immediate consequence of the normal approximation to the sampling distribution of the mean (proportion). Note that if $Z$ were a standard normal RV (with mean 0 and sd 1), then we would have:

$$ \mbox{P}\left( -z_{\alpha/2} < Z < z_{\alpha/2} \right) \approx 1-\alpha. $$

Substitute, then, the centered and scaled sample proportion for Z, i.e. let

$$ Z = \frac{\hat{p} - p}{\sqrt{p(1-p)/n}} $$

and this gives you the confidence interval they presented.

AdamO
  • 62,637