Estimating size of validation cohort

Question

We have generated an elastic net model on a small dataset, where we use gene expression data to calculate a biomarker score to discriminate patients with condition X vs controls.

The dataset is too small to create any meaningful validation sets, so we use nested LOO-CV for performance estimation and the model seems to work reasonably well.

Now, we want to recruit more patients to the study to validate our model further. The question my colleagues are asking is: can we somehow calculate how many patients we should be recruiting? (so the ML equivalent of a statistical power calculation, I guess)

My gut feeling is that the answer is no (the more samples, the better!), but what we could do is decide what kind of tests we want to do with the biomarker scores and do a power calculation for that, i.e. if we will run a t-test to compare biomarker score distributions in control vs patient, we could calculate the sample size to have statistical power in that test.

Am I thinking about this right? Anyone has any pointers? Or materials I could read?

Thanks!

EDIT: Further details on LOO-CV:

For each fold, we took one sample away and we split the remaining set for hyperparameter tuning (also by LOO-CV), and then we used the best model for prediction of on the one sample we took out. We compared the values for both hyperparameters and parameters for each fold, and we saw they are all quite similar. To get a final model, we re-trained on all our samples to create a final "equation" we could use to give scores to new samples (which is what we would like to validate).

Please edit the question to provide more details about how you implemented the LOO-CV. Did you repeat the entire modeling process (e.g., optimization of L2 and L1 penalties) for each LOO fold, thus potentially selecting different predictors each time, or just use the predictors chosen in the full model and do LOO-CV by refitting the regression with those same predictors each time? Please provide that information by editing the question, as comments are easy to overlook and can be deleted. — EdM, Jun 16 '22 at 19:41

EdM · Accepted Answer · 2022-06-17T19:28:41.853

If you have a pre-specified model to evaluate on new samples, there really isn't anything different from standard statistical power calculations. Yes, the more new cases the better, but you can't collect an infinite number of new samples so you need to make tradeoffs for sample size.

You presumably have a logistic regression model for the probability of condition versus control, with your final "equation" incorporating a linear predictor of the form $\mathbf{X}' \beta$ (the sum of terms of each predictor $X_j$ times its corresponding modeled regression coefficient $\beta_j$). That your regression coefficients $\beta$ came from a "machine learning" model doesn't really matter if you intend to use the same coefficients in future work including your model evaluation.

You need to specify the magnitude of the effect (e.g., odds ratio) you'd like to detect, the false-positive rate (typically 0.05 in a two-sided test), the desired power (often 80% or 90%), and the distribution of your linear predictor for the condition and control groups. For that last item you could use the distribution of the linear predictors for the two groups in your data set.

The free G*Power software provides tools for logistic regression power estimates. This page has an answer that explains how those calculations are done.

Simulations can be more straightforward, and ensure that you actually know and specify the underlying assumptions. Simulation has the further advantage of being applicable to arbitrarily complicated situations that might not be handled by standard software. You simulate data based on the data you have, and see how large a sample you need by evaluating multiple new samples at different sample sizes. See the page just linked and this page for hints about implementation.

Estimating size of validation cohort

1 Answers1