0

I want to simulate/show the idea of confidence intervals but am not entirely sure if my approach is correct.

If I‘m not mistaken, a confidence interval can be understood as: if we would draw many samples, a certain proportion of these samples (e.g. 95% of them) would produce confidence intervals that encompass the true population parameter.

I want to show this idea dependent on two parameters: the number of samples I draw and the sample size.

My approach would be:

  1. Create a vector of number of samples and sample sizes
  2. Create some random population data
  3. Calculate the true population mean
  4. Take one of the sample sizes and loop through all number of samples. For all of these drawn samples I will calculate their mean, the standard error of that mean and then I will check if the true population mean lies within +/-1.96 times of that SE. Finally, I calculate the proportion for which this check is TRUE.
  5. I repeat #4 for all other sample sizes.

And my code for that would be:

library(tidyverse)

m0 <- 2.5 std.d0 <- 1 pop <- rnorm(1e6, m0, std.d0) m <- mean(pop) std.d <- sum((m - pop)^2) / 1e6

n <- as.integer(c(5, 10, 15, 20, 30, 50, 100, 200, 500)) times <- as.integer(c(3, 5, 10, 30, 100, 200))

sampling <- function(pop, n, mean, sd, times) { map(.x = times, .f = ~replicate(.x, sample(pop, size = n, replace = FALSE)) %>% as.data.frame() %>% pivot_longer(cols = everything()) %>% group_by(name) %>% summarize(draw_mean = mean(value), draw_se = sd(value) / sqrt(n), hit = between(!!m, draw_mean - 1.96 * draw_se, draw_mean + 1.96 * draw_se), .groups = 'drop') %>% summarize(perc_hit = mean(hit), n = n, rep = .x)) %>% bind_rows() }

results <- map(.x = n, .f = ~sampling(pop = pop, n = .x, mean = m, sd = std.d, times = times)) %>% bind_rows() %>% print(n = Inf)

results %>% mutate(rep = as.factor(rep), n = as.factor(n)) %>% ggplot() + geom_point(aes(x = rep, y = perc_hit, color = n), position = position_jitter(width = 0.1))

deschen
  • 571
  • 2
    We have plenty of confidence interval studies, with code, posted as answers. I found one of mine at https://stats.stackexchange.com/a/248437/919 for instance. And here's another one, done long ago, for an Excel user(!): https://stats.stackexchange.com/a/61478/919. The key idea is to plot your results rather than relying on the statistical summaries alone. – whuber Sep 07 '23 at 22:27
  • 1
  • "the number of samples I draw " ... this only impacts the sampling error in your calculation of the coverage; you can calculate the sampling error in that proportion directly from the binomial and choose a large one that is sufficiently accurate for your needs. 2. I strongly agree with whuber's suggestion of plotting the results; suitably chosen displays are very important for this sort of exercise in conveying intuition.
  • – Glen_b Sep 08 '23 at 00:16
  • 1
    Re my 1. ... unless the purpose was to illustrate how sampling error in the binomial proportion works, but presumably you establish that before tackling CIs (or indeed significance levels or p-values in hypothesis tests), given how central the binomial is to such calculations. – Glen_b Sep 08 '23 at 00:25
  • Thanks for your input. Actually, plotting the results was part of my code, but I omitted that from my question to focus on the relevant simulation part. However, I updated my question with the plotting part. – deschen Sep 08 '23 at 05:45
  • @Glen_b: not 100% sure if I understand you correctly, but looking at the results I think your message is that the sample size parameter is irrelevant for my simulation since it only changes the standard error? And indeed, with a small number of repetitions it doesn‘t matter if I have a small or large sample size - in both cases I‘m not getting close to 95% of the CIs encompassing the true mean. That‘s interesting and counterintuitive to what I thought. – deschen Sep 08 '23 at 05:48
  • I‘m also grateful for the hint to other, already asked questions and I‘ve read a few, but the reasons for my question is that I‘m still not entirely sure if my approach is correct. To be honest, I would have thought that my results would be closer to exactly 95% if I have sufficiently large number of repetitions (e.g. 200). But even then, I‘m often getting 97% or so. – deschen Sep 08 '23 at 05:56
  • You're estimating a proportion (in this case, coverage). The number of samples you use doesn't change the population proportion you're estimating, just the accuracy with which you estimate it. As long as you use a large number of samples, that simulation error is small. You can compute the standard error of the estimate of the estimated proportion as accurately as you require by choosing how large that number of simulations is. – Glen_b Sep 08 '23 at 06:54
  • That can all be done without simulation, just using straightforward margin of error type calculations. https://en.wikipedia.org/wiki/Margin_of_error ... then once you choose a number of samples that meets your needs, that's how many simulations to do. – Glen_b Sep 08 '23 at 07:07
  • Yes, if I were only interested in the specific number of samples, I could look at the MOE, e.g. with a confidence interval of 95% and an desired error around that of 1%, I‘d need to draw 1825 samples (if I got my math right). The idea of my simulation was more about showing that we are getting closer to 95% the more samples we draw. And it‘s also part of a coding exercise. – deschen Sep 08 '23 at 08:36