Target population for power analysis of ML model A/B test

Question

We are working on an ML model that predicts a numeric result (call it $\hat{x}$). Eventually, we will perform an A/B test, where the metric is a function that takes $\hat{x}$ as an input (call it $f(\hat{x})$). Our control group will get the input from the old source (call it $x$, so its metric is calculated as $f(x)$), and the variant group will get the input as an output from the model (thus its metric is calculated as $f(\hat{x})$). We will then test to see whether the variant group metric's mean is higher by a specified effect size than that of the control group.

We want to perform the experiment's power analysis right away, even before the model is finished, so that we understand the experiment and know whether the required sample size is feasible and thus whether the improvement promised by the model is testable in this way. We have a reasonable effect size in mind (call it $\bar{x}$), as well as the standard $\alpha$ (0.05) and power (80%).

We can calculate $f(x)$ on past data, and it is from that data that we hope to shape the power analysis. One thing I am realizing is that the choice of past population from which to get the standard deviation $s$ with which to calculate the power analysis input (Cohen's $d = \bar{x}/s$) can make an enormous impact on what sample size is required for the future experiment. The sample size output is extremely sensitive to that $s$, in our case. So my question is, what is a good population from which to get the $s$? Some of us feel we should use the initial training data for the model. Others feel that constitutes some sort of leakage and we should use the initial test data. It may be the answer varies, depending on other factors. If the train / test choice is not clear cut, what are some guidelines to keep in mind?

Please can you clarify: 1) Whether the ML model in your treatment and the old source in control take the same or a disjoint input dataset to produce $\hat{x}$ and $x$ respectively? 2) Whether the train/test data refers to the input to the ML model, and why there is a significantly different sample standard deviation between these two sets of data? In general though, for experiments with multiple layers input/output that has its own level of variability, it often helps to perform power calculations using some simulations rather than relying on formulas. — B.Liu, Nov 22 '21 at 22:19
$s$ here must be the standard deviation of your metric, which is a function of the completed model, yes? $s$ will depend on your model, so the only hope of getting a sample size before the model is finished is to decide to bound the out-of-sample sd of your metric. — Jonny Lomond, Nov 26 '21 at 20:18
Yes, you are correct about $s$. But it can be calculated on any past data, the model is just predicting the input for calculating it. The question is which past data to use: training, test, or something else. — sparc_spread, Nov 27 '21 at 01:17

Geoffrey Johnson · Accepted Answer · 2021-11-30T02:42:42.050

Here is a related thread on power and prediction. The short answer is you should use/borrow data from a source that is representative of your target. If you are not sure which source is representative then perform several sensitivity analyses (each time borrowing from a different source) to see how the results vary, and give more importance to the more pessimistic results.

I also recommend you take into consideration the estimand you are targeting. There is a recent ICH E9 addendum on estimands in clinical drug development. This is important even if your work is not in clinical drug development. Here is a related thread on this topic. The idea is that there will be post-baseline events like treatment switching or discontinuation called "intercurrent events" that can introduce confounding. How these intercurrent events are addressed define the estimand you are estimating, and can impact the conclusions drawn from the analysis. Are you imagining a world where such intercurrent events would not occur and censoring endpoint observations in your data set? Are you considering a world where such intercurrent events would occur and incorporating these events into the treatment definition? If you are incorporating estimation and inference from earlier studies, make sure their estimand definition and methods for handling missing data align with your planned approach.

If I am understanding your question and comments correctly the estimated standard deviation varies from one sample to the next. This could be due to natural sampling variability, it could be sampling bias, or you are simply sampling from different populations. You could investigate this with confidence intervals and p-values.

Here is a paper on transfer learning for inference on a population quantity such as the mean. It discusses the idea that there may be multiple sources to borrow from and it may not be clear which is representative of the target. Here is a paper for predicting a future experimental outcome. Combining the ideas of these two papers should address your situation. Let me know if you need more details and I can edit my response.

Re censoring: luckily, all samples have a definitive outcome. In the modeled business process, inaction after 30 days marks both the sample as expired. We have to wait for samples lasting beyond that period to complete, so we don't perform the final hypothesis test until all collected samples have completed one way or another. Agreed re sampling from a "source representative of [the] target", and the potential causes you cite for the variability in such sources. I will perform the investigations you suggest. By "representative", would the test data used for scoring the model be appropriate? — sparc_spread, Nov 27 '21 at 17:56
Yes, one option is to assume the test data are representative of the target population. Another option is to assume the test data are not representative of the target population. I added two more links to papers on transfer learning and prediction. — Geoffrey Johnson, Nov 30 '21 at 02:39

Target population for power analysis of ML model A/B test

1 Answers1