Bootstrap hypothesis test and Power Analysis: should I build my Power Analysis on t-test or Bootstrap, if hp testing is done through Bootstrap?

Question

I'm using the bootstrap percentile method to frequently do hypothesis testing (two-tailed test for independent sample where H0 marketing KPI of group A - marketing metric of group B = 0; H1 is ≠ 0) and build the confidence intervals to get statistical significance. Before starting the test, I'm always asked by my stakeholders to provide some hints on whether the test is going to be statistically significant or not. To do use, I was thinking about deploying Power Analysis to get the minimum sample size required for statistical significance.

My doubts are:

I'm doing Power Analysis through Python and trying to use this function here (https://www.statsmodels.org/dev/generated/statsmodels.stats.power.TTestIndPower.solve_power.html#statsmodels.stats.power.TTestIndPower.solve_power), which is based on t-test. Is it correct (considering that there would be a "conceptual misalignment" with the method I'm using for hypothesis testing), or should I do a Power Analysis through bootstrapping? See here the Python code I use:

old_metric = 0.3
new_metric = 0.33
sample_size_values = []
for i in range(len(new_metric)):
    required_n = sms.TTestIndPower().solve_power(
    effect_size = sms.proportion_effectsize(old_metric, new_metric), 
    power=0.8, 
    alpha=0.1, 
    ratio=1)
    )
    required_n = ceil(required_n)
    sample_size_values.append(required_n)

If bootstrapping is needed instead, I should build a function from scratch (as no pre-built one is available in Python) to get the Power level, and from there do some reverse math to get the sample size needed for that Power. I'll take inspiration from this study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340517/#:~:text=Power%20calculation%20by%20bootstrap%20is,to%20analytic%20calculation%20or%20simulation. If you look at "A Simple Example: Laboratory Experiment", you'll see a little algorithm that I can copy. The thing is that, once I get the Power level, what should I consider as data points that I'm going to use as input in this function to get the minimum sample size for significance? The number of bootstraps or the original sample? Consider that the original sample is always having at least 15k data points per group (test and control) and that I'm usually using 1000 bootstraps. For the function mentioned in point 1, it's easier because I need to use the original data sample (to be added in "nobs1"), but really have no idea for bootstrapping.

score 0 · Answer 1 · answered Jun 20 '23 at 11:06

"Statistical significance" means very little and is quite arbitrary. I would instead concentrate on compatibility intervals a.k.a. confidence intervals. Use whichever confidence interval method you think is most likely to be accurate, which often requires a lot of care if using the bootstrap because there are so many variants of the bootstrap. Compute the confidence interval for a data sample. Then use the universal statistical rule that the confidence interval width decreases in proportion to the square root of the sample size. Solve for the larger sample size $n$ such that the expected width of the confidence interval equals the desired precision. You can think of half the width as a margin of error in estimating something. For example, if you observed a margin of error of $\pm 3$ with 0.95 "confidence" at a sample size of $n=100$, bumping the sample size to $n=400$ will lead to an expected margin of error of $1.5$.

Bootstrap hypothesis test and Power Analysis: should I build my Power Analysis on t-test or Bootstrap, if hp testing is done through Bootstrap?

1 Answers1

Linked