I am training a model for time series in a walk-forward fashion: train a binary classifier on data from $t \in \{0,T-1\}$ and predict for $t=T$ (following day), then train on $t \in \{1,T\}$ and predict for $t=T+1$, and so on.
I assessed the calibration of the model using the predict_proba scores and the true labels on the set of all the test sets of the walk-forward procedure, and it does not look great for scores higher than 0.5

Wondering if I could improve it with calibration, I used the test sets up until the last-but-tenth as calibration set. I used the remaining 25 sets (out of 210) as test sets to assess the calibration procedure, and got this:
Overall, the logloss decreased
Spline calibrated log_loss = 0.55 ; Uncalibrated log_loss = 0.71
as well as the Brier score
Spline calibrated Brier Score = 0.18 ; Uncalibrated Brier Score = 0.24
but, visually, it seems that it is not great... I for sure need some more statistics for the higher bins.
In general, is this method of calibrating a model trained in a walk-forward fashion sound, or not? How can I define a calibration set in the case of a walk-forward training?
EDIT
Plotting also the calibration curve shows that it has no clue of what do to for high scores:

