Using s to estimate $\sigma$ when finding the sample size in confidence interval questions

Question

I am trying to learn sample confidence interval for $\mu$ , in this topic , there is a subtopic which is finding the sample size. I know that if $\sigma$ is given (standard deviation of population) , then $$n= \bigg(\frac{z_{\alpha/2}\sigma}{error}\bigg)^2$$

However , we do not always have $\sigma$ , in this case my book suggest two option such that :

By taking a preliminary sample and using $s$ (standard deviation of the sample) to estimate $\sigma$.
By using $\sigma \sim \frac{\text{Range of population}}{4}$

When i read these options , the latter made sense ,but i could not comprehend how to use the former. What i mean is that how can i use standard deviation of the sample to estimate standard deviation of population ? If it possible can you explain it with a example ? Thanks in advance..

Think about this: how can you use the mean within a sample to estimate the mean within the entire population? The principle is the same. See this answer and other discussion on that page. — EdM, Mar 12 '22 at 17:36
@EdM I found a information , it says that "if we know the variance of the population from which we select our sample. Lacking this information, we could take a preliminary sample of size n ≥ 30 to provide an estimate of σ. Then, using s as an approximation for σ in given formula, we could determine approximately how many observations are needed to provide the desired degree of accuracy." . In here , i accept that if there is a given sample if $n>30$ , then we can use its standard deviation to use in place of $\sigma$. SEE NEXT COMMENT... — Not a Salmon Fish, Mar 12 '22 at 19:08
@EdM However , assume that we have only the sample whose size is less than $30$ and has standard deviation $s$ . Then ,will we also use the standard deviation of this sample in place of $\sigma$ again ? — Not a Salmon Fish, Mar 12 '22 at 19:10
The problem with a small sample is that the estimate of population variance itself then has a high variance, so your estimate of the variance from the sample might be far from the true value. See this answer for the case when the underlying distribution is normal. There is no magical cutoff at a sample size of 30. I don't know that the approximation in the second option is necessarily any better. This is good to understand early on, as it's easy to fall into a trap of estimating sample-size requirements without having a good variance estimate. — EdM, Mar 12 '22 at 20:29

EdM · Answer 1 · 2022-03-13T00:10:16.163

The formula you show is an estimate of how large a sample you need to have a chance of $(1-\alpha)$ to find the sample mean within "error" of the true mean. That depends, as you say, on knowing the actual standard deviation $\sigma$ in the population (or the variance $\sigma^2$).

You can use the sample standard deviation as an estimate of the population standard deviation similarly to how you use the sample mean to estimate the population mean. There are two potential problems, however.

First, the sample standard deviation is a biased estimate of the population standard deviation. It is the square root of an unbiased estimate of the population variance.

Second, the sample standard deviation itself has a wide distribution. For a normal distribution, the variance of the sample variance is:

$$ \frac{2\sigma^4}{n-1},$$

where $\sigma^2$ is the population variance and $n$ is the sample size. There isn't any break at a particular value of $n$; the variance of the sample variance just decreases inversely with $(n-1)$.

To see the relative advantages of the 2 proposed methods for estimating $\sigma$, you can try a simulation. Here's an example in R. I made a population of 1000 normally distributed values having mean of 2 and standard deviation of 1.

set.seed(1234)
normData <- rnorm(1000,mean=2,sd=1)
sd(normData)
# [1] 0.9973377

I then took 1000 samples of size $n=10$ from that population and collected the distribution of the SD estimates based on the 2 methods, sample SD and range/4.

sdDist10 <-NULL
rangeDist10 <- NULL
for(i in 1:1000) {sam <- sample(normData,10,replace=TRUE);
sdDist10<-c(sdDist10,sd(sam));
rangeDist10 <- c(rangeDist10,(max(sam)-min(sam))/4)}

I plotted the distributions of the SD estimates obtained those ways:

plot(density(rangeDist10), col="red", xlab="SD estimate", main="For n = 10", [1]bty="n")
lines(density(sdDist10))
legend("topleft",legend="red, from range\nblack, from SD",bty="n")

distribution of SD estimates from sample SD and range

As expected with a small sample, the estimates based on the sample SD had a wide distribution but were fairly symmetric about the true value. The estimates based on the range (in red) tended to be low and with a similarly wide distribution. I don't see that using the second option, based on range, is an improvement at all.

Play with this approach for different values of sample size $n$. If you repeat the above for 1000 samples of size $n = 100$, the sample SD values have a much narrower distribution around the peak near SD = 1 while the range-based estimates tend to over-estimate the true SD.

A take-home point is that you can need a surprisingly large sample size just to get a reliable estimate of the variance, an estimate you need to perform power calculations for study design. You can also use this approach to see how frequently your sample mean values at different sample sizes are within a specified "error" of the actual mean.

It's good to learn how to do quick simulations like this. In complex designs, simulations are often the best way to estimate the sample size needed for a study.

Using s to estimate $\sigma$ when finding the sample size in confidence interval questions

1 Answers1

Linked