1

I am trying to predict a binary outcome. My sample size is very small (n=160) and has a high-class imbalance (80:20). All the variables are highly correlated, and the dataset is high dimensional (the number of variables is 96, and the minority class has 32 samples only).

  1. Can I only use repeated or nested cross-validation instead of using a held-out test set (20% of data) for the final evaluation?
  2. Or should I use cross-validation for hyper-parameter optimization only and then do the final testing on the held-out test set?
  3. What feature selection methods are appropriate for high-correlated, high-dimensional data?
Dushi Fdz
  • 145
  • 3
    Does this answer your question? Validation: Data splitting into training vs. test datasets Also see this post. On feature selection, see this page among many others on this site. – EdM Mar 02 '23 at 17:03
  • Thank you for all the useful links. The first link provides information on bootstrap resampling, but that does not answer the question that if I perform repeated/nested cross-validation, if held-out testing is still necessary. – Dushi Fdz Mar 02 '23 at 18:54
  • 1
    The first two links also make the point that held-out testing is inadvisable unless you have tens of thousands of observations. You could consider your repeated cross-validation as an alternative to bootstrapping; see Section 5.3.4 of Harrell's Regression Modeling Strategies. He suggests 50 repeats of 10-fold cross validation. – EdM Mar 02 '23 at 20:28
  • Thank you @EdM. I will check out that section. I am still a little confused. We do the hyper-parameter optimization with repeated cross-validation and use the same data/approach for the final evaluation as well? – Dushi Fdz Mar 02 '23 at 21:17
  • 1
    In that case, a good approach is to use cross-validation for hyper-parameter optimization, then repeat the entire process (including hyper-parameter optimization) on multiple bootstrapped samples of the data. The "optimism bootstrap" allows you to estimate and correct for overfitting in your modeling process. See this page and this page and this page for the approach. – EdM Mar 02 '23 at 22:13

0 Answers0