2

I am aware that flavors of this question get asked a lot, for e.g., here. I am fine with the sample variance being divided by $n-1$ and that is what makes it an unbiased estimator of the population variance. So far, so good.

However, when looking at stratified sampling, many resources, this for instance, indicate that the within stratum variance should also be obtained by dividing by $N_h - 1$. It is not clear why this should be. Is not the whole idea that each of the strata are homogeneous within themselves and heterogeneous between them? So, should not each stratum by itself be considered a full population by itself thereby leading to divide by $N_h$?

For reference, the formula provided for the population variance at that resource is:

$S_h^2=\frac{1}{N_h-1} \sum_{i=1}^{N_h}\left(Y_{h i}-\bar{Y}_h\right)^2$

and

$\bar{Y}_h = \frac{1}{N_h}\sum_{i=1}^{N_h}Y_{hi}$

Tryer
  • 275

1 Answers1

0

The principle is the same. The "true" mean within each stratum ($\mu_h$) is unknown and must be estimated from the data (the sample mean, $\bar{Y}_h$). In the process, we lose a degree of freedom.

Another way to think about it: we estimate the sample variance using the quantities $(Y_{hi}-\bar{Y}_h)$, for $i=1,\ldots,N_h$. While there are $N_h$ of these, their sum is always zero, so we only have $N_h-1$ degrees of freedom.

Doctor Milt
  • 3,056
  • I edited my post to actually add the formula. That formula does not seem to depend on any sample at all. The summation goes from 1 to the entire strata population size, $N_h$, in both the calculation of the mean and the variance. Hence the confusion. – Tryer Feb 07 '24 at 16:46
  • 1
    Ah, I see what you mean. The observations $Y_{h1}, Y_{h2}, \ldots, Y_{nN_h}$ are the entire subgroup, rather than a sample drawn from this subgroup. Is this correct? – Doctor Milt Feb 07 '24 at 16:54
  • Exactly! that is precisely my source of confusion. There does not seem to be any sampling going on at all in this equation. – Tryer Feb 07 '24 at 16:57
  • 2
    @Tryer, it's common to use $Y_i$ to denote a random variable equal to $y_i W_i$ where $W_i$ is a binary variable equal to either 1 or 0 for whether $y_i$ is in the sample or not. So yes sampling is happening here. If there wasn't any sampling, you're correct that it would not be appropriate to divide by $(n_h - 1)$. – num_39 Feb 15 '24 at 06:30
  • I see. So, the entire stratum population is not being evaluated to obtain the stratum's variance. Rather, a sample of size $N_h$ is being drawn from the $h$th stratum and its variance is being used to estimate the stratum's population variance. That makes more sense. Thank you! – Tryer Feb 15 '24 at 15:43