Brier Score decomposition relies on a foundational concept in statistics called the partition of sums of squares. The basic idea is that we have a sums of squares over any series, we can break that series into arbitrary groups, calculate the sums of squares within each group, calculate the sums of squares between each group using only group means, and then the total sums of squares is equal to the sums of squares of each group plus the sums of squares between the group means.
$$ \text{Total SS} = \text{Group SS} + \text{Residual SS} \tag{1} $$
Wikipedia has a proof of this theorem. There's a neat little cancelation that occurs that makes everything very neat. This cancelation only occurs for sums of squares in particular. Conceptually, it's related to the fact that when we add independent and normally distributed random variables together, the variance of the sum is equal to the sum of variances.
Although this theorem is often applied to variance and discussed in the context of linear regression, the theorem itself is simple algebra and can be used in a wide variety of contexts. Also, we have total freedom as to how we break the series into groups. For example, for a clustering algorithm I might want to talk about "Within Group Variance" and "Between Group Variance". For ANOVA I might want to talk about the "proportion of variance explained." In each case, we simply decompose the sums of squares into a group level contribution and an overall contribution and assign interpretations to each part. The trick is to break the series into meaningful groups so that a meaningful interpretation can be assigned to each group.
For Brier scores, it's assumed that we've broken our forecasts into groups so that we only have a small number of distinct forecasts. In practice this is almost always done with a bucketing strategy. For example, I might divide my forecasts into 5 groups: everything between 0 and 0.2 is treated as 0.1, everything between 0.2 and 0.4 is treated as 0.3, and so on. The number of buckets you can get away with depends on how much data you have, but in practice 5 is usually OK when sample sizes are small, and 10 is usually more than enough to calculate the calibration metrics. You can see scikit-learn's calibration_curve() for a practical example.

In fact, one interpretation of the calibration/reliability component of a Brier score decomposition is that it's simply the MSE of a calibration curve.
Once you've divided your forecasts up into a finite number of buckets, you can calculate the Brier score and its components:
$$ \underbrace{\frac{1}{N}\sum\limits _{t=1}^{N}\sum\limits _{i=1}^{R}(f_{ti}-o_{ti})^2 \,\!}_{\text{Brier Score}}
= \underbrace{\frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(f_{k}-\bar{o}_{k})}^{2}}_{\text{Calibration/Reliability}}
\underbrace{
- \underbrace{\frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(\bar{o}_{k}-\bar{o})}^{2}}_{\text{Resolution}}
+ \underbrace{\bar{o}\left({1-\bar{o}}\right)}_{\text{Uncertainty}}
}_{\text{Refinement}} \tag{2} $$
We start by breaking the Brier score into two components: calibration and refinement:
- The first term, called calibration, calibration error, or reliability, indicates how close the assigned probabilities are to the actual probability in each bucket. For example, for a bucket where the forecasted probability is 10%, we would want about 10% of the examples in that bucket to be positives. If 50% are positive, or only 2%, then our prediction was poorly calibrated.
- The second term, refinement, indicates how good the predictions are after controlling for poor calibration. In this way, it's similar to AUC in that a model with a high predictive power will still get a low (lower is better) refinement score regardless of how poorly calibrated it is.
Sometimes it's useful to further decompose the refinement into two separate terms - resolution and uncertainty.
- Resolution indicates the forecasters ability to go out on a limb and make non-trivial predictions. If every forecast is exactly equal to the prevalence - if the model is basically just spitting out the same constant value for every example - then resolution will be low. Only if the forecaster goes out on a limb and actually tries to adjust their forecast up or down based on knowledge will they get a good resolution score.
- Uncertainty is the only term which does not include the forecast probability at all. This represents the uncertainty of the actual outcome. If outcomes are balanced and the overall prevalence is 50%, then uncertainty will be a maximum of 0.25. If there an imbalance between positives and negatives, uncertainty will be lower. For example, if only 10% of outcomes are positives, then uncertainty will be $0.1 (1 - 0.1) = 0.09$. You might notice this is exactly the same formula as the variance of Bernoulli trial.
The point of Brier score decomposition is to diagnose some of the common ways that a model or a forecaster can go wrong. If they're just straight up terrible, they'll get a low Brier score. If they're pretty good overall but tend to be too conservative or aggressive, or if they incorrectly believe events that only happen 95% of the time are a "sure thing" or that events that happen 5% of the time "can never happen," or if they never put out a score less than 10% because "you never know" even though the actual rate for the bottom bucket is 1%, then they'll be poorly calibrated. If they play it safe and always make a forecast exactly at or near the overall prevalence, then they'll get a poor resolution score.
If you have a model which you believe to be poorly calibrated, you can sometimes do a post hoc correction to recalibrate the output. This is useful for models like neural networks which do not claim to emit a probability, or models like naïve Bayes which are known to crash high or low when their assumptions are violated. You should be careful with this though, as you could simply be papering over a fundamental mismatch between your model and the actual data.