Clarification regarding ML model deployment and fitting it to the entire data set or just the training data

Question

My question is, after I have already performed CV and tuned my model is there a standard practice that data scientists use regarding fitting your final model to the entire data set versus only the training data? Would it potentially add some overfitting if fit to the whole data set? Please correct me if my logic is wrong as these are just my initial thoughts.

While I think it's probably ok to not fit the final model to ALL the data for small data sets (since that means you are only excluding a small portion?), if your data set was large I would think you would want to utilize all of your data otherwise it would almost be wasteful? We don't have to worry about any extra added "overfitting" by fitting to the entire data set because we are not going to go back and change our parameters anyway since that would be "peeking"?

This is my first question on this beautiful website :D Thanks!

score 1 · Answer 1 · answered Mar 19 '22 at 11:19

When you are done training, validating and testing, you should come up with a pessimistic estimation of your model performance. Here, you are essentially using modeling capabilities for evaluation. This gives you an answer to the question "how good is my model", i.e. metrics, confidence scores, ..

However, you have not used all the data for training. More training data will almost always lead to better generalization. There are multiple posts that explain this idea, e.g: "Can increasing the amount of training data make overfitting worse?"

Regarding small/large data: You are a lot more likely to remove instances in any way (sampling, outlier, missing values, ..) of a large data set, because there is still enough data to understand the underlying pattern. In small data sets, not training on all data can cause you model to not be able to generalize well in sparse parts of the hyperspace. But as mentioned earlier, you definitely want to make use of all your data in both cases.

Clarification regarding ML model deployment and fitting it to the entire data set or just the training data

1 Answers1