11

I had to translate several given statistics equations into code, and I came across this formula:

Variance of a simple random sample $= \frac{p(1-p)}{n-1}$

The sample in question are test letters sent to measure the efficiency of a postal operator. Each letter has a value of 1 if it's delivered on time and a value of 0 otherwise.

The parameter $n$ is the total number of sent letters.

The parameter $p$ is the "number of letters delivered on time" and an estimator for the true efficiency. It's defined as the sum of all letter values from 1 to $n$.

I'm not an expert in statistics, but as far as I know the variance is the sum of squared errors from the sample mean divided by $(n-1)$. I couldn't find any explanation for this online. Can anyone explain this formula?

amoeba
  • 104,745

2 Answers2

21

I think it has been pretty much covered by whuber, but I just wish to expand on the use of $n-1$; where it comes from and whether it applies here.

In an ordinary sample variance, many people use an $n-1$ denominator to make the usual sum-of-squares-based variance estimate unbiased (not everyone prefers unbiasedness to other properties though). This is called Bessel's correction but appears to have been derived by Gauss. A simple derivation is here

Presumably whoever wrote that formula has concluded that the same should be done with the usual variance estimate for a binomial proportion, which is generally estimated as $p(1-p)/n$ (where $p$ is the sample proportion).

Can we see whether the expectation of the usual estimator of variance is the population value?

Take $\pi$ to be the corresponding population proportion. That is, does $\text{E}[p(1-p)/n]=\pi(1-\pi)/n$?

Equivalently, does $\text{E}[p(1-p)]=\pi(1-\pi)$?

Note that if $X$ is the observed count, $p = X/n$, where under the usual sampling assumptions, $X\sim \text{binomial}(n,\pi)$.

\begin{eqnarray} \text{E}[p(1-p)] &=& 1/n^2 {E}(X(n-X))\\ &=& 1/n^2 (nEX - EX^2) \\ &=& 1/n^2 (n^2\pi - n\pi(1-\pi) - n^2\pi^2 )\\ &=& 1/n^2 (n^2\pi - n\pi +n\pi^2 - n^2\pi^2 )\\ &=& 1/n^2 \cdot n\pi(n - 1 +\pi - n\pi )\\ &=& 1/n^2 \cdot n\pi(n - 1)(1-\pi)\\ &=& \frac{n-1}{n} \pi (1-\pi) \end{eqnarray}

Hence $\text{E}[p(1-p)/(n-1)]=\pi (1-\pi)/n$

It looks like (assuming I made no errors) it is the case here too - that the usual estimator of the variance of the proportion is biased, and may be unbiased by multiplying the typical estimator by $\frac{n}{n-1}$.

(Edit: In retrospect this is obvious; one simply need apply the ordinary bias calculation for a sample variance to a sample of 0's and 1's)

Which means it appears that the formula you have has been chosen to give an unbiased estimate.

(I wonder why people seem happy to use a biased variance estimate for binomials when there's such an insistence on using an unbiased one in other situations. I have no good answer for that; I'll continue using biased estimators whenever it makes sense to me, which seems to be rather more often than most people do.)

Glen_b
  • 282,281
  • 2
    Great point on the "double standard" in bias acceptance for binomial variance compared to "regular" variance. None of the (many) introductory stats textbooks I use even mentions the p*(1-p)/(n-1) form. One possible explanation is the requirement of a high sample size for proportions in the first place. – Markus Loecher Jan 23 '17 at 08:35
  • I don't think this is correct. The variance of sample is p(1-p) and does not involve n. The variance of the estimated proportion p is p(1-p)/n in the same way that the variance of the estimated mean is s^2 / n – Hernan Nov 03 '20 at 10:50
  • The sample variance is not p(1-p) .... > var(rep(c(0,1),c(4,6))) .... [1] 0.2666667 – Glen_b Sep 24 '21 at 06:56
0

Regarding the subquestion on why divide by (n-1) instead of n?

I've lost my reference, but the legend that comes down to me tells that the inventor of the (n-1) divisor thought it absurd to calculate the variance for a population having a single member. Dividing by (n-1) where n=1 produces an imponderable undefined condition, preventing the absurdity.

Clearly, it hardly matters at all in practice. The irreverent might argue that it's more trouble than it's worth.

(Now, who was it that invented the (n-1) divisor? Or should I say "whom"?)