1

After a discussion with some colleagues, I've realized we've different views on which is the go-to strategy for model training.

Strategy A: Train-Validation-Test Split and Final Model Selection

  1. Divide the data into train, validation, and test sets.
  2. Use the train and validation sets to determine optimal hyperparameters.
  3. Test the model on the test set for unbiased performance evaluation.
  4. Select the best model from the train-validation phase and consider it the final model.

Strategy B: Re-Training with All Data

  1. Divide the data into train, validation, and test sets.
  2. Use the train and validation sets to determine optimal hyperparameters.
  3. Test the model on the test set for unbiased performance evaluation.
  4. Train a new model using all three datasets (i.e. all data currently available) with the best hyperparameters. Consider this new model the final model.

Maybe the most important thing I'd like to be sure about is:

  • are both strategies valid and used in practice?

If they are,

  • which would you say is the most adopted?

If anyone has experience or insights into the trade-offs of these strategies I would greatly appreciate your input. Are there any best practices or guidelines that could help me decide which strategy to pursue? Or perhaps a hybrid approach that combines the strengths of both strategies?

Cheers!

rusiano
  • 564
  • 2
  • 14
  • The trouble with the first approach is that it discards valuable training data that could have been used to tighten up mode estimates. This is remedied by the second approach, but the issue is that the model being put into production has yet to demonstrate its ability to predict out-of-sample. Enter bootstrap validation. – Dave Aug 31 '23 at 13:33

0 Answers0