Sample size calculation: z or t

Question

I am confused about what I've seen in my textbook to calculate the sample size for a specified confidence interval width (for a mean).

The book says whether you know $\sigma$ or have $S$ from a previous study, you can use

$$z_{\alpha /2}\frac{\sigma}{\sqrt{n}}=x$$

where $x$ is half the width of the interval.

What doesn't make sense to me is that if you use that formula using a value $S$ instead of $\sigma$, shouldn't you be using a $t$ value instead of $z$? Which requires $n$ itself, but maybe you could use an iterative method where you choose an initial $n$ and keep updating the value of $t$ until convergence.

The reason I think $t$ should be used is because when you actually go and collect your data, you will be constructing the interval with $\bar{Y} \pm t_{\alpha/2,n-1}\frac{S}{\sqrt{n}}$, so if you used $z$ to calculate the sample you probably won't end up with the correct width, will you?

dipetkov · Accepted Answer · 2022-08-08T04:53:44.047

Update: @Paje points out that the theory behind the computations assumes that the population is normally distributed; when normality is violated the theory doesn't hold exactly. It's an important point but keep in mind that "not exact" doesn't necessarily mean "wrong"; it means "approximate". The relevant question is: Is the normality assumption reasonably satisfied to justify my analysis? So for fun, I've updated the simulation to sample from a) the normal; b) the Laplace distribution which is symmetric, with heavier tails than the normal; c) the log-normal which is skewed to the right.

Terminology: The half-width of a confidence interval is known as its margin of error. I'll use "margin" and "half-width" interchangeably.

It seems that by "end up with the correct width" you mean that the half-width of the confidence interval is exactly $\operatorname{margin}$ if we calculate the sample size $n$ to achieve margin of error $\operatorname{margin}$. Not quite.

If we have an accurate estimate $\hat{\sigma}$ of the true standard deviation $\sigma$, then we don't need to estimate $\sigma$ from the experimental data. We can plug in $\hat{\sigma}$ in the formula for the confidence interval and the margin of error $z_{\alpha/2}\hat{\sigma}/\sqrt{n}$ is fixed.

If we decide to estimate $\sigma$ with the sample standard deviation $s$, then the margin of error is $t_{\alpha/2,n-1}s/\sqrt{n}$ as you point out. And since $s$ is a random variable, it can be either smaller or bigger than $\sigma$. In other words, if we repeat the experiment with the same sample size $n$, the margin of error $t_{\alpha/2,n-1}s/\sqrt{n}$ will vary from replication to replication because $s$ varies. It's never going to be exactly equal to $\operatorname{margin}$.

What matters is that the coverage of the confidence interval is $100(1-\alpha)$% under the null hypothesis, ie, if the null hypothesis is true and we repeat the experiment many times, 95% of the confidence intervals thus constructed will contain the true mean. As long as the "known" $\hat{\sigma}$ is an accurate estimate of the true standard deviation $\sigma$ and the distribution is not asymmetric, the sample size calculation $n \approx (z_{\alpha/2}\hat{\sigma}/\operatorname{margin})^2$ results in (approximately) correct coverage for the $z$ and $t$ confidence intervals. The approximation gets better with larger sample size ⇔ smaller margin of error.

distribution	mean	std.dev	margin	n	z_coverage	z_lower	z_upper	t_coverage	t_lower	t_upper
normal	0	1	0.50	17	0.9498	0.0264	0.0238	0.9456	0.0268	0.0276
laplace	0	1.41	0.50	33	0.9564	0.0202	0.0234	0.9544	0.0218	0.0238
lognormal	1.65	2.16	0.50	76	0.9560	0.0060	0.0380	0.9146	0.0818	0.0036
normal	0	1	0.10	404	0.9598	0.0212	0.0190	0.9568	0.0224	0.0208
laplace	0	1.41	0.10	808	0.9588	0.0212	0.0200	0.9538	0.0246	0.0216
lognormal	1.65	2.16	0.10	1886	0.9562	0.0160	0.0278	0.9476	0.0358	0.0166
normal	0	1	0.05	1615	0.9530	0.0254	0.0216	0.9488	0.0274	0.0238
laplace	0	1.41	0.05	3229	0.9562	0.0216	0.0222	0.9496	0.0252	0.0252
lognormal	1.65	2.16	0.05	7541	0.9520	0.0216	0.0264	0.9404	0.0366	0.0230

R code to calculate the sample size using the normal approximation $n \approx (z_{\alpha/2}\hat{\sigma}/\operatorname{margin})^2$ and then estimate the coverage of the $z$ and $t$ confidence intervals for the mean.

# the true mean and standard deviation
mu_true <- 0
sigma_true <- 1
the coverage of the confidence intervals should be 100(1-alpha) = 95%
alpha <- 0.05
calculate_sample_size <- function(margin, known_sigma) {
Use the normal approximation to choose the sample size
z_alpha <- qnorm(1 - alpha / 2)
  ceiling((z_alpha * known_sigma / margin)^2)
}
estimate_std_dev <- function(sigma) {
How accurate is the "known" standard deviation?
Let's assume it is 2.5% higher than true std. deviation.
sigma * 1.025
}
get_moments <- function(distribution = c("normal", "laplace", "lognormal")) {
  if (distribution == "lognormal") {
    # The log-normal distribution is not symmetric;
    # it's skewed to the right.
    mu_pop <- exp(mu_true + sigma_true^2 / 2)
    sd_pop <- sqrt((exp(sigma_true^2) - 1) * exp(2 * mu_true + sigma_true^2))
  } else if (distribution == "laplace") {
    # The Laplace distribution is symmetric, with mean = location
    # and variance = 2 * scale^2.
    mu_pop <- mu_true
    sd_pop <- sqrt(2) * sigma_true
  } else {
    mu_pop <- mu_true
    sd_pop <- sigma_true
  }
  c(mu_pop, sd_pop)
}
coverage <- function(n, distribution = c("normal", "laplace", "lognormal")) {
  distribution <- match.arg(distribution)
if (distribution == "lognormal") {
    x <- rlnorm(n, meanlog = mu_true, sdlog = sigma_true)
  } else if (distribution == "laplace") {
    x <- VGAM::rlaplace(n, location = mu_true, scale = sigma_true)
  } else {
    x <- rnorm(n, mean = mu_true, sd = sigma_true)
  }
xbar <- mean(x)
mean_stddev <- get_moments(distribution)
  mu_pop <- mean_stddev[1]
  sd_pop <- mean_stddev[2]
sd_known <- estimate_std_dev(sd_pop)
z_alpha <- qnorm(1 - alpha / 2)
  t_alpha <- qt(1 - alpha / 2, df = n - 1)
c(
    abs(xbar - mu_pop) < z_alpha * sd_known / sqrt(n),
    mu_pop > xbar + z_alpha * sd_known / sqrt(n),
    mu_pop < xbar - z_alpha * sd_known / sqrt(n),
abs(xbar - mu_pop) &lt; t_alpha * sd(x) / sqrt(n),
mu_pop &gt; xbar + t_alpha * sd(x) / sqrt(n),
mu_pop &lt; xbar - t_alpha * sd(x) / sqrt(n)

)
}
calculate_coverage <- function(margin_of_error, distribution) {
mean_stddev <- get_moments(distribution)
  true_mean <- mean_stddev[1]
  true_stddev <- mean_stddev[2]
  known_stddev <- estimate_std_dev(true_stddev)
sample_size <- calculate_sample_size(margin_of_error, known_stddev)
nreps <- 5000
  stats <- rowMeans(replicate(nreps, coverage(sample_size, distribution)))
data.frame(
    "distribution" = distribution,
    "mean" = true_mean,
    "std dev" = true_stddev,
    "margin of error" = margin_of_error,
    "sample size" = sample_size,
    "z_coverage" = stats[1],
    "z_lower" = stats[2],
    "z_upper" = stats[3],
    "t_coverage" = stats[4],
    "t_lower" = stats[5],
    "t_upper" = stats[6]
  )
}
set.seed(12345)
rows <- data.frame()
for (margin in c(0.5, 0.1, 0.05)) {
  for (distribution in c("normal", "laplace", "lognormal")) {
    rows <- rbind(rows, calculate_coverage(margin, distribution))
  }
}
knitr::kable(rows, format = "pipe")

Paje · Answer 2 · 2022-08-07T13:53:16.930

On top of @dipetkov's great and detailed answer, a (not so) small warning : these confidence intervals are accurate and true only for Normally Distributed (independent) random variables, and approximately true for n sufficiently large (but how approximate and how large?).

if $\forall i, X_i \sim \mathcal{N}(\mu, \sigma^2)$ and independent then we have $$\frac{1}{n} \sum_i^n X_i = \overline{X_n} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$$ So indead if we know in advance the std $\sigma$, a $1-\alpha$ CI of the mean estimation $\hat{\mu}$ is $\overline{x_n} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$. no limits here, and if you run simulations, fix an $n$, the average nb of experiments that ends with $\mu \in CI(x_1, ... x_n)$ will tend to exactly $1-\alpha$ when the nb of experiments tend to $\infty$. math here, no approximations.

When you don't know in advance the std $\sigma$ of your normal distribution, you use the Student distribution, which is, by definition, a random variable $\mathcal{T}_m$ ($m$ number of degrees of freedom) that has same law as a $ Z / \sqrt{U/m}$ when $Z \sim \mathcal{N}(0,1)$ and $U \sim \chi^2_m$. Math tells us that $$\sqrt{n}\Big(\frac{\overline{X_n} - \mu}{S_n}\Big) \sim \mathcal{T}_{n-1}$$ Hence the $1-\alpha$ CI for the estimator of $\mu$ : $\overline{x_n} \pm t_{\alpha/2,n-1} \frac{s_n}{\sqrt{n}}$. Again, this is true for all n, not just approximately true when n is large or anything

Now the warning : when $X$ does not have a Normal Distribution, these CI on the mean estimation are wrong : the probability of the real $\mathbb{E}[X]$ (if it exists! see Cauchy distributions for an example without it) being inside the CI computed on a particular experiment of n samples is not $1-\alpha$. Yes, the TCL says that when n tends to $\infty$, the r.v. $\sqrt{n}\frac{\overline{X_n} - \mu}{\sigma}$ converges in distribution to $ \mathcal{N}(0, 1)$ but we don't actually know starting from what $n$ this approximation will be "almost true" (or at least negligeable to have almost $1-\alpha$ experiments ending with $\mu \in CI$. One usually says $n > 30$ as a rule of thumb, but be careful : try experiments for n = 100, $ X \sim \mathcal{B}(p)$ a Bernoulli distribution with variable $p$ between 0 and 1. see this article for detailed work, fun figure n°3. Do reproduce it at home.

Good point. The theory behind the $t$ confidence interval assumes the true distribution is Normal. The approximation will hold well as long as the distribution is symmetric/not skewed even if it's not exactly Normal. See also Independent samples t-test: Do data really need to be normally distributed for large sample sizes?. — dipetkov, Aug 07 '22 at 14:09

Sample size calculation: z or t

2 Answers2

the coverage of the confidence intervals should be 100(1-alpha) = 95%

Use the normal approximation to choose the sample size

How accurate is the "known" standard deviation?

Let's assume it is 2.5% higher than true std. deviation.