Baffling results in the simulation of a power analysis of a sequential, online A/B experiment

Question

I conducted a simulation that had baffling results and would love your help understanding.

Context: I want to estimate the required sample size to detect a difference in proportions for a sequential, online experiment. Users land on my web page one by one and 50% of them convert on some action. I believe that in my new web page 70% will convert. I will run an A/B test where some traffic will get the status quo page (control) and other traffic will get the new variant. What sample size do I need in each sample? power.prop.test(p1=0.7, p2 = 0.5, power = .80, sig.level = 0.05) tells me I need ~93 samples to have 80% power in detecting a difference.

Methods: I generated 10,000 sequences of 150 visitors each. The visitors in the null had a 50% probability of converting, and in the variant they have a 70% chance of success. For each sequence, I calculated the cumulative proportion (trace) of successes. I then calculated the 95% quantiles of these traces.

Questions:

The 2.5% quantile of the 70% band and the 97.5% quantile of the 50% band cross at exactly 93 (if you run 50,000 simulations). This is, presumably, the sample size I'd feel comfortable with because the distributions would be sufficiently different at this point. Why would these lines cross at the power=80% sample size?
I tried calculating power: the proportion of 50% sequences which were above the 2.5% quantile of the 70% variant. When I do this, I don't get 20% (which I'd expect), but get 98.61%, which is the power of a one-sided binomial test at this sample size. What's going on here? I have no intuition behind why this is the case, or really why the binomial test is different than prop.test.

Simulation code:

library(ggplot2)
library(purrr)
library(dplyr)
generate_sequence <- function(n_trials, prob=prob){
  s <- rbinom(n_trials, 1, prob = prob)
  cum_mean <- cumsum(s)/1:n_trials
  return(cum_mean)
}
generate_sims <- function(n_trials, prob, n_sims){
  sims <- map_df(1:n_sims, function(i) {
    tibble(sim = i, 
           prob = prob, 
           idx = 1:n_trials,
           cum_mean = generate_sequence(n_trials, prob = prob))
  })
  return(sims)
}
generate_quantiles <- function(sims){
  q <- sims %>%
    group_by(idx, prob) %>%
    summarize(
      p025 = quantile(cum_mean, probs = 0.025),
      # p500 = quantile(p, probs = 0.500),
      p500 = mean(cum_mean),
      p975 = quantile(cum_mean, probs = 0.975),
    ) %>%
    ungroup()
  return(q)
}
Requires sample size of 93
power.prop.test(p1=0.7, p2 = 0.5, power = .80, sig.level = 0.05)
Conduct Simulation
set.seed(123)
n_trials <- 150
n_sims <- 10000
Ho <- main(n_trials=n_trials, prob=0.5, n_sims=n_sims)
Ha <- main(n_trials=n_trials, prob=0.7, n_sims=n_sims)
Plot simulation
Quantiles
q <- Ho$q %>% bind_rows(Ha$q)
ggplot(q, aes(x=idx, ymin=p025, y = p500, ymax = p975, 
           color=as.factor(prob), fill=as.factor(prob))) + 
  geom_ribbon(alpha = 0.1) + 
  geom_line() + 
  theme_minimal() + 
  geom_vline(xintercept = 93) +
  annotate(geom='text', x = 100, y = .9,
            label="N = 93 for 80% power, alpha=5%") + 
  labs(
    title = "95% percentiles for averages of sequences",
    y = "Mean of sequence",
    x = "Index in sequence",
    fill = "Distribution",
    color = "Distribution")
Identify the first index where the 2.5th percentile of the 0.7
distribution is greater than the 97.5% of the 0.5 distribution
Says sample size of 92 to 94. This aligns with the 93 sample
required by 80% power and 5% alpha under power.prop.test
which(Ha$q$p025 > Ho$q$p975)[0:20]
Binomial test: My simulation produces the "gr" binomial test.
I guess "gr" makes sense because I'm taking 2.5% from the Null
and 2.5% from the Alternative
https://stats.stackexchange.com/a/550865/158148
set.seed(123)
pv = replicate(10^5, binom.test(x=rbinom(1,92,.70),n=92,p=0.5,alt="gr")$p.val)
mean(pv <= .05) # power = 0.9861
Power calculated from the simulation
I can calculate power by computing the fraction of Null
simulations greater than the 0.025% cutoff of the
alternative distribution
cutoffs <- tibble(idx = 1:n_trials, cutoff = Ha$q$p025)
x <- Ho$sims %>% left_join(cutoffs, by = 'idx')
x %>%
  group_by(idx) %>%
  summarize(
    # mistaken acceptance of the null hypothesis as the result of a test procedure, false negative
    beta = mean(cum_mean > cutoff), 
    power = 1-mean(cum_mean > cutoff)
  ) %>%
  filter(idx >= 88)
idx   beta   power
<int>  <dbl>  <dbl>
88   0.0221  0.978
89   0.017   0.983
90   0.0222  0.978
91   0.0177  0.982
92   0.0136  0.986
93   0.0184  0.982 <--- Same number as the power
94   0.0147  0.985
95   0.0117  0.988
96   0.0158  0.984
97   0.0127  0.987

Edit 1

In the following simulation, I calculate the power for multiple statistical tests (binomial, chisq.test and prop.test). I'm able to back-out the 80% power for prop.test, so that now makes sense.

But my confusion is this: when I do my "quantiles" test, which is where I obtain n_sims proportions for my Ha distribution for a sample size of N = 93, calculate the lower 2.5% quantile of this distribution and count the fraction of null draws above the 2.5% quantile. This simulates the false-negative rate. From this I get a power of 0.9809, which is very similar to the power of the binomial test.

Why does using the quantiles in this way reflect the power of the binomial test?
My original question: My quantiles simulations above (see plot) show that at N=93 the 95% CIs for the Ho and Ha distributions no longer intersect. If comparing non-overlapping quantiles reflects the power of an exact test, why do they intersect perfectly where I'd achieve 80% power for an approximate test?

library(dplyr)
set.seed(1)
power.prop.test(p1 = 0.7, p2 = 0.5, sig.level = 0.05, power = .80)
n <- 93
n_sims <- 10000
Quantiles test
g1 <- replicate(n_sims, mean(rbinom(n, 1, 0.7)))
g0 <- replicate(n_sims, mean(rbinom(n, 1, 0.5)))
g1_025 <- quantile(g1, 0.025)
(power <- 1 - mean(g0 > g1_025)) # 0.9809
get_p <- function(){
  s1 <- rbinom(1, n, 0.7)
  s0 <- rbinom(1, n, 0.5)
  mat <- matrix(data = c(s0, n-s0, s1, n - s1), ncol=2)
compare the different tests
p_b <- binom.test(x = s1, n = n, p=0.5)$p.val
  p_bgr <- binom.test(x = s1, n = n, p=0.5, alt='gr')$p.val
  p_c <- chisq.test(mat, correct=T)$p.val
  p_cc <- chisq.test(mat, correct = F)$p.val
  p_p <- prop.test(x = c(s0, s1), n = c(n, n), correct = T)$p.val
  p_pc <- prop.test(x = c(s0, s1), n = c(n, n), correct = F)$p.val
  tibble(p_b, p_bgr, p_c, p_cc, p_p, p_pc)
}
p_values <- map_df(1:n_sims, ~ get_p())
fail_to_reject <- p_values > 0.05
beta <- apply(fail_to_reject, 2, mean)
power <- 1 - beta
p_b    p_bgr    p_c   p_cc    p_p   p_pc
0.9719 0.9898 0.7495 0.7944 0.7495 0.7944

score 1 · Accepted Answer · answered Mar 06 '22 at 12:27

It looks like the problem here is with how you're doing you're power analysis but please let me know if I've misunderstood something.

With a sample of 93, the means (and thus also variance) of the samples will vary considerable from sample to sample. If you run this simulation thousands of times and average results (as you seem to have done with your graph), you will get very precise estimates, but presumable you're only going to run your experiment once so you won't have thousands of iterations to average.

For the power analysis, you need to compare one control sample to one treatment sample. Here's a simple example:

control <- c(1, 0)
treatment <- c(rep(1, 7),
               rep(0, 3))
power <- 0
set.seed(300)
for(i in 1:10000)  {
c <- sample(control, 93, replace = TRUE)
  t <- sample(treatment, 93, replace = TRUE)
mat <- matrix(data = c(sum(c), 93 - sum(c),
                         sum(t), 93 - sum(t)),
                ncol = 2)
p <- chisq.test(mat, correct = FALSE)$p.value
if(p <= 0.05) {power <- power + 1}
}
power / 10000

The resulting power is 0.798. So if you randomly sampled from a distribution of 0.5 and a distribution of 0.7 with a sample size of 93 for each and you did this thousands of times (as we did above), ~80 percent of the time you could reject that null under the significance level of 0.05.

Notes. As I understand it, in this case, the prop.test in R implements a Chi-squared test utilizing the Yates continuity correction. However, with this correction, the power is ~0.75. Utilizing the chisq.test directly allows one to calculate the p-value without the Yates continuity correction. Whether you should use the Yates continuity correction, some other corrections, or some other test for a case like this is debated, and there are several good threads on this here.

I am a little surprised, though, that the prop.power.test function seems to calculate the power without the Yates continuity correction when the prop.test only implements the correction but perhaps someone else can explain this.

Your comment has been really helpful - it's helped me see I wasn't thinking about power in the right way. Now equipped with better language, I added Edit 1 above. My confusion still stands (Edit 1, questions 1 and 2). — RAFisherman, Mar 07 '22 at 07:00

Baffling results in the simulation of a power analysis of a sequential, online A/B experiment

Requires sample size of 93

Conduct Simulation

Plot simulation

Quantiles

Identify the first index where the 2.5th percentile of the 0.7

distribution is greater than the 97.5% of the 0.5 distribution

Says sample size of 92 to 94. This aligns with the 93 sample

required by 80% power and 5% alpha under power.prop.test

Binomial test: My simulation produces the "gr" binomial test.

I guess "gr" makes sense because I'm taking 2.5% from the Null

and 2.5% from the Alternative

https://stats.stackexchange.com/a/550865/158148

Power calculated from the simulation

I can calculate power by computing the fraction of Null

simulations greater than the 0.025% cutoff of the

alternative distribution

idx beta power

<int> <dbl> <dbl>

88 0.0221 0.978

89 0.017 0.983

90 0.0222 0.978

91 0.0177 0.982

92 0.0136 0.986

93 0.0184 0.982 <--- Same number as the power

94 0.0147 0.985

95 0.0117 0.988

96 0.0158 0.984

97 0.0127 0.987

Edit 1

Quantiles test

compare the different tests

p_b p_bgr p_c p_cc p_p p_pc

0.9719 0.9898 0.7495 0.7944 0.7495 0.7944

1 Answers1