0

I conducted a simulation that had baffling results and would love your help understanding.

Context: I want to estimate the required sample size to detect a difference in proportions for a sequential, online experiment. Users land on my web page one by one and 50% of them convert on some action. I believe that in my new web page 70% will convert. I will run an A/B test where some traffic will get the status quo page (control) and other traffic will get the new variant. What sample size do I need in each sample? power.prop.test(p1=0.7, p2 = 0.5, power = .80, sig.level = 0.05) tells me I need ~93 samples to have 80% power in detecting a difference.

Methods: I generated 10,000 sequences of 150 visitors each. The visitors in the null had a 50% probability of converting, and in the variant they have a 70% chance of success. For each sequence, I calculated the cumulative proportion (trace) of successes. I then calculated the 95% quantiles of these traces.

Questions:

  1. The 2.5% quantile of the 70% band and the 97.5% quantile of the 50% band cross at exactly 93 (if you run 50,000 simulations). This is, presumably, the sample size I'd feel comfortable with because the distributions would be sufficiently different at this point. Why would these lines cross at the power=80% sample size?
  2. I tried calculating power: the proportion of 50% sequences which were above the 2.5% quantile of the 70% variant. When I do this, I don't get 20% (which I'd expect), but get 98.61%, which is the power of a one-sided binomial test at this sample size. What's going on here? I have no intuition behind why this is the case, or really why the binomial test is different than prop.test.

simulation

Simulation code:

library(ggplot2)
library(purrr)
library(dplyr)

generate_sequence <- function(n_trials, prob=prob){ s <- rbinom(n_trials, 1, prob = prob) cum_mean <- cumsum(s)/1:n_trials return(cum_mean) }

generate_sims <- function(n_trials, prob, n_sims){ sims <- map_df(1:n_sims, function(i) { tibble(sim = i, prob = prob, idx = 1:n_trials, cum_mean = generate_sequence(n_trials, prob = prob)) }) return(sims) }

generate_quantiles <- function(sims){ q <- sims %>% group_by(idx, prob) %>% summarize( p025 = quantile(cum_mean, probs = 0.025), # p500 = quantile(p, probs = 0.500), p500 = mean(cum_mean), p975 = quantile(cum_mean, probs = 0.975), ) %>% ungroup() return(q) }

Requires sample size of 93

power.prop.test(p1=0.7, p2 = 0.5, power = .80, sig.level = 0.05)

Conduct Simulation

set.seed(123) n_trials <- 150 n_sims <- 10000 Ho <- main(n_trials=n_trials, prob=0.5, n_sims=n_sims) Ha <- main(n_trials=n_trials, prob=0.7, n_sims=n_sims)

Plot simulation

Quantiles

q <- Ho$q %>% bind_rows(Ha$q) ggplot(q, aes(x=idx, ymin=p025, y = p500, ymax = p975, color=as.factor(prob), fill=as.factor(prob))) + geom_ribbon(alpha = 0.1) + geom_line() + theme_minimal() + geom_vline(xintercept = 93) + annotate(geom='text', x = 100, y = .9, label="N = 93 for 80% power, alpha=5%") + labs( title = "95% percentiles for averages of sequences", y = "Mean of sequence", x = "Index in sequence", fill = "Distribution", color = "Distribution")

Identify the first index where the 2.5th percentile of the 0.7

distribution is greater than the 97.5% of the 0.5 distribution

Says sample size of 92 to 94. This aligns with the 93 sample

required by 80% power and 5% alpha under power.prop.test

which(Ha$q$p025 > Ho$q$p975)[0:20]

Binomial test: My simulation produces the "gr" binomial test.

I guess "gr" makes sense because I'm taking 2.5% from the Null

and 2.5% from the Alternative

https://stats.stackexchange.com/a/550865/158148

set.seed(123) pv = replicate(10^5, binom.test(x=rbinom(1,92,.70),n=92,p=0.5,alt="gr")$p.val) mean(pv <= .05) # power = 0.9861

Power calculated from the simulation

I can calculate power by computing the fraction of Null

simulations greater than the 0.025% cutoff of the

alternative distribution

cutoffs <- tibble(idx = 1:n_trials, cutoff = Ha$q$p025) x <- Ho$sims %>% left_join(cutoffs, by = 'idx') x %>% group_by(idx) %>% summarize( # mistaken acceptance of the null hypothesis as the result of a test procedure, false negative beta = mean(cum_mean > cutoff), power = 1-mean(cum_mean > cutoff) ) %>% filter(idx >= 88)

idx beta power

<int> <dbl> <dbl>

88 0.0221 0.978

89 0.017 0.983

90 0.0222 0.978

91 0.0177 0.982

92 0.0136 0.986

93 0.0184 0.982 <--- Same number as the power

94 0.0147 0.985

95 0.0117 0.988

96 0.0158 0.984

97 0.0127 0.987

Edit 1

In the following simulation, I calculate the power for multiple statistical tests (binomial, chisq.test and prop.test). I'm able to back-out the 80% power for prop.test, so that now makes sense.

But my confusion is this: when I do my "quantiles" test, which is where I obtain n_sims proportions for my Ha distribution for a sample size of N = 93, calculate the lower 2.5% quantile of this distribution and count the fraction of null draws above the 2.5% quantile. This simulates the false-negative rate. From this I get a power of 0.9809, which is very similar to the power of the binomial test.

  1. Why does using the quantiles in this way reflect the power of the binomial test?
  2. My original question: My quantiles simulations above (see plot) show that at N=93 the 95% CIs for the Ho and Ha distributions no longer intersect. If comparing non-overlapping quantiles reflects the power of an exact test, why do they intersect perfectly where I'd achieve 80% power for an approximate test?
library(dplyr)
set.seed(1)

power.prop.test(p1 = 0.7, p2 = 0.5, sig.level = 0.05, power = .80)

n <- 93 n_sims <- 10000

Quantiles test

g1 <- replicate(n_sims, mean(rbinom(n, 1, 0.7))) g0 <- replicate(n_sims, mean(rbinom(n, 1, 0.5))) g1_025 <- quantile(g1, 0.025) (power <- 1 - mean(g0 > g1_025)) # 0.9809

get_p <- function(){ s1 <- rbinom(1, n, 0.7) s0 <- rbinom(1, n, 0.5) mat <- matrix(data = c(s0, n-s0, s1, n - s1), ncol=2)

compare the different tests

p_b <- binom.test(x = s1, n = n, p=0.5)$p.val p_bgr <- binom.test(x = s1, n = n, p=0.5, alt='gr')$p.val p_c <- chisq.test(mat, correct=T)$p.val p_cc <- chisq.test(mat, correct = F)$p.val p_p <- prop.test(x = c(s0, s1), n = c(n, n), correct = T)$p.val p_pc <- prop.test(x = c(s0, s1), n = c(n, n), correct = F)$p.val tibble(p_b, p_bgr, p_c, p_cc, p_p, p_pc) } p_values <- map_df(1:n_sims, ~ get_p()) fail_to_reject <- p_values > 0.05 beta <- apply(fail_to_reject, 2, mean) power <- 1 - beta

p_b p_bgr p_c p_cc p_p p_pc

0.9719 0.9898 0.7495 0.7944 0.7495 0.7944

  • 1
  • When you take thousands of replications for g1, your empirical quantiles will essentially match the theoretical quantiles for the fixed proportion of 0.7. So it's as if you're comparing g0 to a fixed proportion, which is a binomial test.
  • – num_39 Mar 07 '22 at 07:20
  • 1
  • " If comparing non-overlapping quantiles reflects the power of an exact test, why do they intersect perfectly where I'd achieve 80% power for an approximate test?" Non-overlapping quantiles are not an exact test of power. For the relation between confidence intervals and tests of significance, see https://towardsdatascience.com/why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance-48360559900a
  • – num_39 Mar 07 '22 at 07:37
  • 1
  • Good point, thanks! 2) Thanks - that last point reflected my true misunderstanding. This caused me to re-run the above simulation with different alpha levels. When I do that, the intervals don't "stop overlapping" at 80% power, so my theory was just a coincidence. Instead, I should probably simulate the confidence interval of the difference.
  • – RAFisherman Mar 10 '22 at 05:44