Walk-forward calibration

Question

I am training a model for time series in a walk-forward fashion: train a binary classifier on data from $t \in \{0,T-1\}$ and predict for $t=T$ (following day), then train on $t \in \{1,T\}$ and predict for $t=T+1$, and so on.

I assessed the calibration of the model using the predict_proba scores and the true labels on the set of all the test sets of the walk-forward procedure, and it does not look great for scores higher than 0.5

Wondering if I could improve it with calibration, I used the test sets up until the last-but-tenth as calibration set. I used the remaining 25 sets (out of 210) as test sets to assess the calibration procedure, and got this:

Overall, the logloss decreased

Spline calibrated log_loss = 0.55 ; Uncalibrated log_loss = 0.71

as well as the Brier score

Spline calibrated Brier Score = 0.18 ; Uncalibrated Brier Score = 0.24

but, visually, it seems that it is not great... I for sure need some more statistics for the higher bins.

In general, is this method of calibrating a model trained in a walk-forward fashion sound, or not? How can I define a calibration set in the case of a walk-forward training?

EDIT

Plotting also the calibration curve shows that it has no clue of what do to for high scores:

What are the size of the train/validation/test sets? I think you may be better off starting with a regression model (that should be more reasonably calibrated out of the box). The first graph shows pretty good evidence of over-fitting, and then the re-calibration equation may be over-fit as well (or at least not as steep). — Andy W, Sep 22 '22 at 12:01
During the walk-forward, each train set has 7200 point and each test set has 24 points. How do you spot overfitting in those graphs? — shamalaia, Sep 22 '22 at 12:07
I am using an XGBoost classifier at the moment. Why do you think that a regressor would be more calibrated? Also, I do not really know how to asses the calibration of a regressor. Would you discretize the result and assign a label 1 or 0 based on that? — shamalaia, Sep 22 '22 at 12:09
I am using regression model in a more general sense, you could fit a logistic regression model and generate probabilities the same way you use XGBoost. The model is overfit, since you are predicting things to happen at very high probabilities and they only happen around 40% of the time! (Why do I think regression is more calibrated out of the box -- this has come up on the site before.) — Andy W, Sep 22 '22 at 12:35
I also spot the 80 vs 40% problem, but I thought it was some weirdness due to the calibration setup I am using. Other than the model problem, do you think the setup makes sense? — shamalaia, Sep 22 '22 at 12:45
Boosted models are pretty easy to overfit in my experience (I often don't get improved results over regression until around 20k+ cases, although I don't deal with time series data.) I am not 100% sure on the rolling approach from your description (the calibration set should be completely separate from the train set, and IMO you should have a larger test set, more on the order of 1000 than a few dozen.) To be clear just guessing opinions from a few sentences on the internet. — Andy W, Sep 22 '22 at 12:53

Walk-forward calibration

0 Answers0