2

We are working on an ML model that predicts a numeric result (call it $\hat{x}$). Eventually, we will perform an A/B test, where the metric is a function that takes $\hat{x}$ as an input (call it $f(\hat{x})$). Our control group will get the input from the old source (call it $x$, so its metric is calculated as $f(x)$), and the variant group will get the input as an output from the model (thus its metric is calculated as $f(\hat{x})$). We will then test to see whether the variant group metric's mean is higher by a specified effect size than that of the control group.

We want to perform the experiment's power analysis right away, even before the model is finished, so that we understand the experiment and know whether the required sample size is feasible and thus whether the improvement promised by the model is testable in this way. We have a reasonable effect size in mind (call it $\bar{x}$), as well as the standard $\alpha$ (0.05) and power (80%).

We can calculate $f(x)$ on past data, and it is from that data that we hope to shape the power analysis. One thing I am realizing is that the choice of past population from which to get the standard deviation $s$ with which to calculate the power analysis input (Cohen's $d = \bar{x}/s$) can make an enormous impact on what sample size is required for the future experiment. The sample size output is extremely sensitive to that $s$, in our case. So my question is, what is a good population from which to get the $s$? Some of us feel we should use the initial training data for the model. Others feel that constitutes some sort of leakage and we should use the initial test data. It may be the answer varies, depending on other factors. If the train / test choice is not clear cut, what are some guidelines to keep in mind?

Royi
  • 1,135
  • 1
    Please can you clarify: 1) Whether the ML model in your treatment and the old source in control take the same or a disjoint input dataset to produce $\hat{x}$ and $x$ respectively? 2) Whether the train/test data refers to the input to the ML model, and why there is a significantly different sample standard deviation between these two sets of data? In general though, for experiments with multiple layers input/output that has its own level of variability, it often helps to perform power calculations using some simulations rather than relying on formulas. – B.Liu Nov 22 '21 at 22:19
  • Disparate input sets; 2) Train/test refers to the ML input. The significant std dev differentiations are in the past populations we've been analyzing for the power analysis. This std dev is itself highly variant... one month's worth of data can have a very different std dev from another's.
  • – sparc_spread Nov 25 '21 at 05:04
  • 1
    $s$ here must be the standard deviation of your metric, which is a function of the completed model, yes? $s$ will depend on your model, so the only hope of getting a sample size before the model is finished is to decide to bound the out-of-sample sd of your metric. – Jonny Lomond Nov 26 '21 at 20:18
  • Yes, you are correct about $s$. But it can be calculated on any past data, the model is just predicting the input for calculating it. The question is which past data to use: training, test, or something else. – sparc_spread Nov 27 '21 at 01:17