6

Previously, I do believe $S^2$ is an unbiased estimator of $\sigma^2$

$$S^2 = \frac{1}{n-1}\sum_{i=1}^n{\left(X_i-\bar{X}\right)^2}$$

is a correct conclusion.

However, I found the following statement:

Considering the sample variance:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\left(y_i -\bar{y}\right)^2$$

it can be shown (see Appendix A, Derivations) that

$$E(s^2) = \frac{N}{N-1}\sigma^{2}$$

This is an example based on simple random sample without replacement. It says $S^2$ is a biased estimator of $\sigma^2$.

So I am wondering "$S^2$ is an unbiased estimator of $\sigma^2$" can only be applied to some specific cases? How to understand this result based on simple random sample?

Alexis
  • 29,850
  • 1
    Others should be aware that $n$ is the sample size, $N$ is the population size, and the sample is drawn from the finite population without replacement. – Matthew Gunn Sep 17 '16 at 23:22
  • Some related questions: https://stats.stackexchange.com/questions/576101/unbiased-estimator-of-population-variance-for-sampling-without-replacement https://stats.stackexchange.com/questions/70124/unbiased-estimator-of-variance-for-samples-without-replacement https://stats.stackexchange.com/questions/70086/could-bessels-correction-make-sample-variance-estimation-even-more-biased https://stats.stackexchange.com/questions/588753/why-does-the-finite-population-variance-requires-the-n-1-factor-in-the-literat – Henry Mar 10 '23 at 01:36

3 Answers3

12

When sampling from a finite population without replacement, the observations are negatively correlated with each other, and the sample variance $s^2 = \frac{1}{n-1} \sum_i \left( x_i - \bar{x} \right)^2$ is a slightly biased estimate of the population variance $\sigma^2$.

The derivation in this link from Robert Serfling provides a clear explanation of what's going on. The author first proves that if the observations in a sample have constant covariance (i.e. $\mathrm{Cov}\left(x_i, x_j \right) = \gamma$ for all $i\neq j$) that: $$ E[s^2] = \sigma^2 - \gamma$$

For independent draws (hence $\gamma = 0$), you have $E[s^2] = \sigma^2$ and the sample variance is an unbiased estimate of the population variance. But the issue you have with sampling without replacement from a finite population is that your draws are negatively correlated with each other!

In the case of sampling without replacement from a population of size $N$: $$ \text{For $i\neq j$ }\quad \mathrm{Cov}\left(x_i, x_j \right) = \frac{-\sigma^2}{N-1}$$ Hence: $$ E\left[s^2\right] = \frac{N}{N-1}\sigma^2 $$

Matthew Gunn
  • 22,329
3

The sample variance is indeed biased for a finite population with simple random sampling without replacement. And the solution to get an unbiased result is to multiply the sample variance by $\frac{N-1}{N}$, where $N$ is the population size.

I’m an engineer, not a mathematician. So my proof was to build a complete sampling distribution in Excel from a finite population and assuming sampling without replacement. I found that the mean of the sampling distribution sample variances ($s^2$) did not equal the population variance. $s^2$ is biased in this case. I don’t know why the literature so often ignores this fact. But if I multiply the mean $s^2$ by $\frac{N-1}{N}$, where $N$ is the population size, then lo and behold the product is exactly equal to the population variance.

Intuitively, as my sample size n increases and approaches and eventually equals the population size $N$ ($n=N$), I should expect the sample variance to approach the population variance if the sample variance is unbiased. That does not happen since the sample is divided by $n-1$ and the population by $N$. Multiplying the sample variance by $\frac{N-1}{N}$ solves this dilemma.

Alexis
  • 29,850
1

I don't know where your statements come from, but it the way you present them they are false. Taking directly the variance of the sample (that is, dividing by $n$) we get a biased estimator, but using sample variance (dividing by $n-1$) we get an unbiased estimator.

I think your statement comes from different conflicting sources or your source uses different notations in different parts. Maybe "$s^2$" means variance ($n$) in one page and sample variance ($n-1$) in the other. The fact that one formula uses "$n$" with the same meaning the other uses "$N$" makes me suspect that they aren't consistent.

Alexis
  • 29,850
Pere
  • 6,583
  • Sorry I forget to mention, as Gunns said: "that n is the sample size, N is the population size, and the sample is drawn from the finite population without replacement. " – Bratt Swan Sep 18 '16 at 04:25