What does substituting a population total for an estimate mean for the standard error?

Question

I have to calculate the minimum sample size for a proportion estimate required in order to meet a criterium on the standard error of this kind:

$$ \text{SE}_P = \sqrt{\frac{P(1-P)}{n}} \sqrt\frac{N-n}{N-1} \lt \text{SE}_\text{req} $$

Where $\text{SE}_\text{req}$ is a constant (calculated by a formula).

However, the population size $N$ is itself an estimate.

I am trying to find out whether I can just substitute this estimate into the formula for $\text{SE}_P$ or if the fact that $N$ is an estimate changes the standard error? I have no luck googling it.

The $\sqrt{(N-n)/(N-1)}$ term is also known as the finite population correction (fpc) factor. I haven't seen an example of how the fpc changes if $N$ is unknown but that may be a good path forward to answer this question. There are cases where the fpc can be ignored but that doesn't really answer your question. — joshualeond, Aug 09 '22 at 18:40
Can you give more info on how $N$ was estimated? Was it assumed to be normally distributed? — joshualeond, Aug 15 '22 at 21:59

score 0 · Answer 1 · answered Aug 17 '22 at 21:46

I've been watching this question hoping someone would step in and drop some statistical theory for us but until then I can point to a possible applied solution to your problem.

I mentioned in my comment that the $\sqrt{(N-n)(N-1)}$ term is also known as the finite population correction (fpc) factor. The name is pretty self explanatory: it's used to correct the estimate when the population is finite. I found this answer gave a ton of intuition for the fpc as this being a case where you are sampling from a Hypergeometric distribution.

If we assumed the population was infinite then the classical way to estimate the sample size no longer depends on the fpc and is:

$$ n = \frac{p(1-p)}{\text{SE}^2} $$

For this case, let's say we want to estimate the true proportion of $p$ to a standard error of no worse than $0.05$ (5 percentage points) and to be conservative we choose that $p=0.5$. Then we would have the following (code in Julia):

se = 0.05
p = 0.5
n = (p / se)^2
n

## 100

Now we can see the effect that a finite population has on this estimated sample size as well. Given the formula from your question we could solve this for $n$ and we'd have the following:

$$ n = \frac{N}{\frac{\text{SE}^2(N-1)}{p(1-p)} + 1} $$

If we reuse our chosen $\text{SE}$ and $p$ and assume we know the total population and it is $N = 100$, then the estimated sample size would be:

N = 100
N / ((N - 1) * (se)^2 / (p * (1-p)) + 1)

## 50.25

This definitely jives with our intuition that we shouldn't need as many samples if the population is small.

Getting back to your question, there are a couple ways I could think of propagating the uncertainty of the estimated $\hat N$:

If we are dealing with a linear function then we could use linear uncertainty propagation. If you're familiar with the Julia programming language, there is a package called Measurements.jl that makes this very easy.
If we are dealing with a non-linear function then we could use Monte-Carlo sampling instead. In Julia the MonteCarloMeasurements.jl package has this implemented.

We can verify if the function we're working with is linear or non-linear with respect to $N$ by plotting the function:

using CairoMakie
xs = 0:0.01:600
ys = @. xs / ((xs - 1) * (se)^2 / (p * (1-p)) + 1)
lines(xs, ys; axis=(; xlabel=L"N", ylabel=L"n"))

The above plot shows us this function is non-linear with respect to $N$ and is approaching $100$ as $N \to \infty$. Since it's non-linear we should make use of simulations to better understand $n$ given $\hat N$:

using MonteCarloMeasurements
N_est = 100 ± 25 # Construct Gaussian uncertain parameter
n_est = N_est / ((N_est - 1) * (se)^2 / (p * (1-p)) + 1)
n_est

## Particles{Float64, 2000}
##  49.4265 ± 6.69

The result from above could be a bit misleading since it gives you back a $\pm$ but this does not mean the result is symmetric so it's good to plot the raw samples to have a better idea of what's going on:

samp_quant = pquantile(n_est, [0.025, 0.5, 0.975])
hist(Array(n_est), bins = 50, axis = (;xlabel=L"n", title = "Samples of estimated sample size after population uncertainty propagation"))
vlines!(samp_quant, color = :black, linestyle = :dash)
current_figure()

So via simulation we have a distribution for $n$ and we can now see this distribution is a bit skewed.

If we wouldn't have taken into account the uncertainty of $\hat N$ then we would have chosen $n \approx 50$ (note that's our median as well) but given that we can do uncertainty propagation we may want to choose a more conservative choice of $n \approx 60$.

Another thing this answer doesn't address is power but your question didn't mention it so I'll just leave this reference to Chapter 16 (Design and Sample Size Decisions) in Regression and Other Stories by Andrew Gelman, Jennifer Hill, Aki Vehtari. The pdf is free for personal use and can be found at the following link.

What does substituting a population total for an estimate mean for the standard error?

1 Answers1