Widespread overfitting in health domain research?

Question

I was reading about flaws with model selection techniques such as elimination based on significance and backwards selection via AIC (or similar) in the context of regression leading to inflated coefficients, narrow confidence intervals, and p-values that are lower than they should be.

In the health domain, the area in which I work, it's not uncommon to see such techniques used in journal articles. Whilst techniques such as train/test splitting and cross-validation can reduce this issue, these are also often not used since the primary goal is statistical inference and not prediction per se.

This led me to think about the following questions:

Because much research in the health domain uses such model selection techniques without correcting for them does this mean it's likely that many models in this domain are overfitted, and therefore the results of these articles are inflated?
Even though the primary goal is inference and not prediction, it still seems to be these regression models can be overfitted. Since these articles often do not train:test or cross-validate, is this a further source of overfitting that could be very widespread in this domain?
Although machine learning is mostly concerned with prediction, could it not also be a valid option for describing relationships whilst also reducing this proposed overfitting that I hypothesise is occurring when researchers use such statistical methods? After all, if an algorithm can predict with decent performance on an unseen dataset then it clearly "understands" the relationship between the variables and the output (whilst some algorithms may be black-box, many have feature importance tools (RF, XGBoost) and xAI techniques can be considered to elucidate these relationships).
Since machine learning techniques are often better equipped when features are many, could an appropriate solution to model selection be deriving the most important features from machine learning algorithms that perform best and then using those for prediction. This seems like a better choice than model selection based on significance or backwards selection.

I am an expert in none of the fields related to this question so apologies for my ignorance.

Adding to the problems you cite, few biomedical studies have the necessary sample size to make train/test splits reliable. Frank Harrell has a concise explanation here — EdM, Jul 15 '22 at 13:58
Overfitting is actually more dangerous in inference than in prediction. An overfit model might still offer reasonable, useful predictive accuracy, with predictions correlated with the truth. But an overfit model applied to inference is worse than useless, because it will tell lies. — Betterthan Kwora, Jul 16 '22 at 01:15

score 23 · Accepted Answer · answered Jul 15 '22 at 11:15

You are correct that overfitting is a rampant problem in health research, just as it is in all other fields in which sample sizes are not huge. One of the biggest mistakes being made in recent years is to assume that machine learning algorithms somehow fix this problem. While algorithms can be tuned with cross-validation to not overfit, many such as random forests typically result in massive overfitting.

It is a not correct to use one method of supervised learning to select features to promote for use in another method. The second method has lost the context and does not know how to apply the proper amount of shrinkage. In addition, the first method has a very low chance of finding the "right" features. For example many practitioners think that lasso finds the right features when it fact it usually fails miserably in that task.

I go into many of these issues at length in RMS and BBR.

The most general-purpose, safest, and interpretable solution is heavy use of unsupervised learning (sparse principal components; regular principal components after doing variable clustering, etc.). This allows either traditional or ML supervised learning to be used on the reduced, combined, features with much stability and without much overfitting.

Data are not capable of informing us about which variables are important, so we should stop trying to use data to do that. A simple simulation in RMS shows that. With a limited number of candidate features, a high signal:noise ratio, and zero collinearity, stepwise variable selection still has a very low chance of selecting the right variables. Same with methods such as lasso. If it can't work in an ideal setting it can't work on real data.

Thanks a lot, Professor Harrell. I must say, I still struggle a little with how overfitting can still occur if you use a test set or cross-validate. Can this be entirely owed to the fact that the data in the test set still comes from the same original dataset, therefore impact generalizability? In that case, would the problem be much reduced in external validation? Thanks for your time. — JED HK, Jul 15 '22 at 11:35
100 repeats of 10-fold cross-validation, or the bootstrap, are used to reveal overfitting, not to fix it. — Frank Harrell, Jul 15 '22 at 12:08
Sorry, I should have been more clear, but I meant adapting the data or approach to the results of CV rather than the training set. If you tune your approach to these results I then find it struggle to imagine overfitting — JED HK, Jul 15 '22 at 12:54
If you adapt the result using CV you'd need to add an outer loop to validate the "process of processes" which is computationally prohibitive. — Frank Harrell, Jul 15 '22 at 13:54

score 17 · Answer 2 · answered Jul 15 '22 at 11:31

It is common to find flawed data analysis in health research, not only flawed machine learning analysis but also flawed standard statistical analysis.
cross-validation just helps you to estimate out-of-sample prediction, but it won't help you to correct standard errors or p-values of individual coefficients in a model. You can use bootstrap for that, where feature elimination and model selection steps are repeated in each bootstrap repetition.
Machine learning is not magic. If you are afraid of over-fitting some standard statistical model, it won't help to use an even more flexible model. Various feature importance metrics are not improvements on your standard statistical tools but rather bad approximations. Usually, they just tell you what is happening in your model, but they cannot be made to make inferences about effects in the population. Also, the goal of a statistical model is not just to find any relationship between variables but to estimate an effect of a variable while correcting for other variables or other known structures in your data.
backward selection is not a recommended tool, and statisticians actively warn against it. If you care about valid statistical inference, then the solution is not to exchange bad statistics for machine learning but to use good statistics. I am not saying that machine learning doesn't have its place in health research, and there cannot be interesting research questions answered using ML methods, but it is not an automatic solution to statistical problems.

score 17 · Answer 3 · answered Jul 15 '22 at 11:35

As a complement to Frank Harrell's excellent answer, there are now a number of studies that basically find exactly what one would expect:

Evidence of Inflated Prediction Performance: A Commentary on Machine Learning and Suicide Research
A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models
Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

While not fixing the problem, appropriate (especially: external) evaluation techniques at least help be aware of it, as Frank also argued in this excellent paper.

Finally, as a shameless self-plug, we go into a number of the related technical challenges in a recent survey of ours concerning responsible ML for medicine.

Thanks a lot Eike, I will take a look at the resources – JED HK Jul 15 '22 at 11:38 — JED HK, Jul 15 '22 at 11:38

Widespread overfitting in health domain research?

3 Answers3

Linked