Refitting model on entire dataset to use in production to make prediction on latest data?

Question

I have a question regarding calibration (train) and holdout (test) periods.I have time based data. I split my data into two sets. Fit/Train (first 2/3 of data) and validation/testing (latest third). So I validated my model to see what was predicted and compared this to what actually happened in this time period. Example: Fit the 2021 and 2022 data and make prediction for 2023. Validate by comparing to what actually happenend in 2023. I want to make predictions about the future and of course I want to use the latest data (2023) which is in my holdout set. Now of course I don't want to refit my model on the entire period (test and validation) as this would make all evaluation useless. Do I just use my model, provide the entire data period (all 3 years) for prediction even though my model is fitted on part of this data?

I use ParetoNBDFitter from https://btyd.readthedocs.io/en/latest/index.html

+1 If it weren’t for the time-dependent nature of your data, a bootstrap approach would make sense to me. Because of the time component, however, it is less clear to me how to proceed. — Dave, Oct 25 '23 at 11:28

score 2 · Answer 1 · answered Oct 25 '23 at 14:21

I understand this is an open topic that won't have a final answer applicable to all cases. This question provides some interesting answers regarding the topic.

In my personal experience I try to use a time-dependent train-test split approach: Basically creates multiple train test splits that progressively uses more data and validate that my model (trained from scratch for each split) works well.

For example:

Train test split (5/10, 5/10) and evaluate
Train test split (6/10, 4/10) and evaluate ...
Train test split (9/10, 1/10) and evaluate

If this works, I am confident that the model dynamics works well for the task and can deliver a final model trained on the whole dataset. But please take this with a grain of salt, the evaluation is task-specific and there are cases where it is absolutelly crucial to keep a out-of-sample dataset validation approach.

Refitting model on entire dataset to use in production to make prediction on latest data?

1 Answers1