General method for determining the sample size given the standard error

Question

What is the formula for calculating the minimum sample size needed to achieve a specified relative precision for a given sample statistic's standard error, at a specified level of significance?

Mostly I find formulas specific for the sample mean. But I'm looking for the text-book description of the general case. Ideally this should already take into account the case where the distribution of the sample statistic is unknown and one needs to find the standard error via bootstrapping.

My approach

What I tried by reverse engineering is this relation for the standard error of a sample statistic $SE(\theta)$, the relative precision $\delta$ and significance level $\alpha$ $(\delta, \alpha \in [0,1])$.

$$ \begin{align} SE(\theta) \overset{!}{≥} \frac{\delta \cdot \theta}{z(\alpha)} \, , \label{eq:one}\tag{1} \end{align} $$ with the critical value $z$ of a one-/two-tailed hypotheses test for a given probability distribution. Since in general $\text{SE}(\theta) \sim \frac{1}{\sqrt{n}}$, one can solve (1) for $n$ to compute the appropriate sample size

Example

This article summarizes the formulas for $SE(\theta)$ of sample mean $\bar{X}$, sample variance $S^2$ and standard deviation $S$ in the case of a normally distributed data set: $$ \begin{align*} \sigma_{\bar{X}} &= \frac{\sigma}{\sqrt{n}} \label{eq:two} \tag{2} \\ \sigma_{S^2} &= \sigma^2\sqrt{\frac{2}{n-1}} \label{eq:three} \tag{3} \\ \sigma_{S} &\approx \frac{1}{\sqrt{2(n-1)}} \quad \text{(for n > 10) } \, .\label{eq:four} \tag{4} \end{align*} $$

This leads to the minimum number of required samples $n$: $$ \begin{align} n_{\bar{X}} &= \left(\frac{\sigma}{\bar{X}} \cdot \frac{z_{1-\alpha/2}}{\delta}\right)^2 \label{eq:five} \tag{5} \\ n_{S^2} &= 2\left(\frac{z_{1-\alpha/2}}{\delta}\right)^2 +1 \label{eq:six} \tag{6} \\ n_{S} &= \frac{1}{2}\left(\frac{z_{1-\alpha/2}}{\delta}\right)^2 +1 \label{eq:seven} \tag{7} \end{align} $$

The calculation implemented in Python:

import numpy as np
from scipy import stats
def get_n(
    stat: str,
    alpha: float,
    delta: float,
    n_sides: int = 2,
    loc: float = 0.0,
    scale: float = 1.0,
) -> int:
    """
    Computes the required number of samples to draw from a normal distribution
    in order to estimate the mean, variance or standard deviation with
    specified precision and significance.
Parameters
----------
stat : str
    Specifies the statistic for which to compute the number of samples
alpha : float
    Significance level used to compute the critical value of the
    tailed hypothesis test
delta : float
    Required relative precision of the parameter estimation
n_sides : int, optional
    Specifies wether a one- or two-tailed hypothesis is used to compute
    the critical value.
    Default is 2.
loc: float, optional
    Position of the normal distribution. Default is 0.0
scale: float, optional
    Width scale of the normal distribution. Default is 1.0

Returns
-------
n : int
    Number of samples
&quot;&quot;&quot;

# Critical value for tailed hypothesis test
z: float = stats.norm.ppf(1 - alpha / n_sides)

# Number of samples
n: float
if stat == &quot;mean&quot;:
    n = ((scale * z) / (loc * delta)) ** 2
elif stat == &quot;var&quot;:
    n = 2 * (z / delta) ** 2 + 1
elif stat == &quot;std&quot;:
    n = 1.0 / 2 * (z / delta) ** 2 + 1
else:
    raise ValueError(
        &quot;Expected string from ['mean', 'var', 'std'] &quot;
        f&quot;for paramter 'stat', got '{stat}' ({type(stat).__name__})&quot;
    )

# Return integer
return int(np.ceil(n))

For example, running this in the case of the sample mean, $\delta = 0.01$, $\alpha=0.05$ (two-tailed test) and normal distribution centered around 42 with width 2.3:

loc = 42 # known population mean
scale = 2.3  # known population standard deviation
alpha = 0.05  # significance in %
delta = 0.01  # relative accuracy in %
stat = "mean"
n = get_n(
    stat=stat,
    alpha=alpha,
    delta=delta,
    loc=loc,
    scale=scale,
)

--> n = 116

A simple simulation to do the validation confirms it (using np.random.normal(loc=loc, scale=scale, size=n) and comparing the sample mean np.mean() to the population mean loc):

General method for determining the sample size given the standard error

My approach

Example

0 Answers0