I was reading about flaws with model selection techniques such as elimination based on significance and backwards selection via AIC (or similar) in the context of regression leading to inflated coefficients, narrow confidence intervals, and p-values that are lower than they should be.
In the health domain, the area in which I work, it's not uncommon to see such techniques used in journal articles. Whilst techniques such as train/test splitting and cross-validation can reduce this issue, these are also often not used since the primary goal is statistical inference and not prediction per se.
This led me to think about the following questions:
Because much research in the health domain uses such model selection techniques without correcting for them does this mean it's likely that many models in this domain are overfitted, and therefore the results of these articles are inflated?
Even though the primary goal is inference and not prediction, it still seems to be these regression models can be overfitted. Since these articles often do not train:test or cross-validate, is this a further source of overfitting that could be very widespread in this domain?
Although machine learning is mostly concerned with prediction, could it not also be a valid option for describing relationships whilst also reducing this proposed overfitting that I hypothesise is occurring when researchers use such statistical methods? After all, if an algorithm can predict with decent performance on an unseen dataset then it clearly "understands" the relationship between the variables and the output (whilst some algorithms may be black-box, many have feature importance tools (RF, XGBoost) and xAI techniques can be considered to elucidate these relationships).
Since machine learning techniques are often better equipped when features are many, could an appropriate solution to model selection be deriving the most important features from machine learning algorithms that perform best and then using those for prediction. This seems like a better choice than model selection based on significance or backwards selection.
I am an expert in none of the fields related to this question so apologies for my ignorance.