2

Something I have been noticing lately, is that Machine Learning practitioners appear to be talking more about Cross Validation a lot more than those working with traditional statistical models (e.g. Regression).

For example, in almost any Machine Learning paper, the results section usually contains an extensive Cross Validation section where the Machine Learning models (e.g. Random Forest, Gradient Boosting, Neural Networks) are evaluated against testing and validation sets. As a matter of fact, emerging and key concepts in Machine Learning such as "Double Descent" (e.g. https://openai.com/blog/deep-double-descent/) are rooted in the milieu of Cross Validation.

On the other hand, I don't see Cross Validation being used as extensively in Statistics papers (e.g. within Epidemiology). In many papers, I see that the researchers might fit a Regression model to their data - and then report on the quality of the "model fit" and the statistical significance of the regression coefficients; but they do not seem to be engaging in Cross Validation as such (i.e. building their model on a random subset of the data).

I have researchers make arguments that modern Cross Validation has its roots purely in Statistics. The theoretical framework of Cross Validation is largely borrowed from concepts such as Bootstrap Sampling in which we assert that statistics calculated on multiple random samples from a population can converge to the true value of this statistics.

However, I am still not sure as to why Cross Validation appears to be more popular in Machine Learning compared to Statistics. How exactly would Cross Validation be used to test if the "statistical significance and impact of covariate predictors on the response variable are similar on the training data vs. the test data?

Can someone please comment on this?

Dave
  • 62,186
stats_noob
  • 1
  • 3
  • 32
  • 105
  • 2
    What is statistics, what is machine learning, and what is the border between the two? I do not see how to answer this question without first answering those. – Dave Nov 24 '22 at 17:17
  • @ Dave: thank you for your comment! I guess the distinction I was making was between specifically "Regression Models" (Statistics) vs models such as Random Forest, Gradient Boosting, Neural Networks. I agree - the border between statistics and machine learning is often described as blurry (e.g. we could argue that a regression model is trying to "learn" optimal parameters based on a dataset - and that in machine learning we want a model that we "statistically believe" will generalize well to unseen data). – stats_noob Nov 24 '22 at 17:24
  • The answer is No. Cross-validation is extensively used in Statistics and statistical literature. See for a review A survey of cross-validation procedures for model selection. – patagonicus Dec 04 '22 at 03:39
  • Your last edit raised an interesting question but really belongs as a separate question. I have rolled back the edit. The history is there, however, if you want to copy and paste your verbatim phrasing into a new question. – Dave Dec 04 '22 at 07:31
  • @ Dave: thank you! how do I view the history? – stats_noob Dec 04 '22 at 07:32
  • Click the “edited” above the username of the last member to edit the post. For other administrative details like this, you might want to take the site tour or ask on Meta. – Dave Dec 04 '22 at 07:35
  • Thank you for this suggestion! – stats_noob Dec 04 '22 at 07:58
  • @ Dave: I just posted it here: https://stats.stackexchange.com/questions/597895/cross-validating-inference-models – stats_noob Dec 04 '22 at 08:04

1 Answers1

4

Cross-validation is a tool for assessing prediction performance. It applies equally well to any type of model, but perhaps is even more obviously needed for models with hyperparameters that need tuning (e.g. RF, GBDT, NNs etc.), while fitting a statistical model via maximum likelihood may seem to not have tunable hyperparameters (but you can of course apply penalization etc.).

If you do inference such as answering a question on whether a treatment compared with placebo improves blood pressure based on a randomized trial using a pre-specified regression analysis model, there's usually no point in doing cross-validation. There's no hyperparameters to choose and the predictive capability of the model is not really per-se of interest.

There's of course areas that lie between these two cases, where the answer is not so clear and where mixtures of approaches may be appropriate.

Note: As an aside, by most definitions of "machine learning" any regression model is a form of machine learning.

Björn
  • 32,022