It is common to break down the total sum of squares (related to the total variance in $y$) in a regression problem, into components.
$$ y_i-\bar{y} = (y_i - \hat{y_i} + \hat{y_i} - \bar{y}) = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) $$
$$( y_i-\bar{y})^2 = \Big[ (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) \Big]^2 = (y_i - \hat{y_i})^2 + (\hat{y_i} - \bar{y})^2 + 2(y_i - \hat{y_i})(\hat{y_i} - \bar{y}) $$
$$SSTotal := \sum_i ( y_i-\bar{y})^2 = \sum_i(y_i - \hat{y_i})^2 + \sum_i(\hat{y_i} - \bar{y})^2 + 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$$
In a linear regression with the $\hat y_i$ estimated by ordinary least squares, $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]=0$. Therefore, in such a setting, it is common to refer to $\sum_i(y_i - \hat{y_i})^2$ as the residual or unexplained sum of squares, with the remaining $\sum_i(\hat{y_i} - \bar{y})^2$ considered the "explained" sum of squares. This leads to the interpretation of $R^2$ as the proportion of variance explained.
$$ \text{Proportion of variance explained} = \dfrac{\left.\sum_i(\hat{y_i} - \bar{y})^2\middle/N\right.}{\left.\sum_i ( y_i-\bar{y})^2/N\right.}$$$$ =\dfrac{SSExplained}{SSTotal}$$$$ =\dfrac{SSTotal - SSUnexplained - 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]}{SSTotal}$$$$ =\dfrac{SSTotal - SSUnexplained}{SSTotal} \space\space\bigg(\star\bigg)$$$$ = 1 - \dfrac{SSUnexplained}{SSTotal}$$$$ = 1 - \left(\dfrac{ \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }\right) = R^2 $$
However, step $\left(\star\right)$ relied on $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]=0$, which is not true for every regression model. This need not hold for regularized linear regression estimation, for instance, and I give a few other examples at the end.
A reasonable interpretation of this $R^2$ is as a measure of by how much the square loss changes. Start out by defiing $\frac{\sum_i\left( y_i-\bar y \right)^2 - \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }= 1 - \left(\frac{ \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }\right)$ as a value of interest. The $\frac{\sum_i\left( y_i-\bar y \right)^2 - \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }$ expression is of the form $\frac{\text{Starting Value} - \text{Ending Value}}{\text{Starting Value}}$. If we started with $\$200$ and wound up with $\$150$, we would say there was a $25\%$ reduction in the money. Thus, I say that if we start with $SSTotal = 200$ and wind up with $SSUnexplained = 150$, there was a $25\%$ reduction in sum of squares.
The $\sum_i\left( y_i-\hat y_i \right)^2$ as the residual or unexplained sum of squares makes total sense to me. This is what the regression model misses or "cannot explain" so whatever is left over, the $\sum_i(\hat{y_i} - \bar{y})^2 + 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$, should be what is explained, yet if $\sum_i(\hat{y_i} - \bar{y})^2$ is the explained sum of squares and $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]\ne 0$, then the total sum of squares is equal to something other than the explained and unexplained sums of squares.
This leave me unsure about what to make of $\sum_i(\hat{y_i} - \bar{y})^2$ on its own. When $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big] = 0$, then I get it. When $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]\ne 0$, I am stuck. Does $\sum_i(\hat{y_i} - \bar{y})^2$ always have an interpretation as an "explained" sum of squares? If so, what do we make of the $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$ in light of the above $R^2$ derivation sure seeming like a description of how much total variance is explained (the amount by which the variance is reduced) without relying on $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]=0$, just on $\frac{\sum_i\left( y_i-\bar y \right)^2 - \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }= 1 - \left(\frac{ \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }\right)=:R^2?$
EXAMPLES
Below, I give some examples where $\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]\ne 0\ne 0$. There are a few GLMs (all with intercepts) estimated through the usual maximum likelihood techniques, a linear model (with an intercept) whose coeficients are estimated using a technique other than OLS, and a support vector regression. Every time, $\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$ is different enough from zero to convince me that it's not just a floating point arithmetic issue, especially given how close to zero $\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$ is in the OLS linear regression case at the end.
# Log-link in a GLM that minimizes the sum of squared residuals
#
set.seed(1)
N <- 100
x <- rbeta(N, 1, 1)
Ey <- exp(4 - x)
y <- rnorm(N, Ey, 1)
G <- glm(y ~ x, family = gaussian(link = "log"))
yhat <- predict(G, type = "response")
sum((y - yhat) * (yhat - mean(y)))
#
# I get 20.45738
Logistic regression
Sum of squared residuals is related to the Brier score and is legitimate
to calculate for a logistic regression
set.seed(1)
N <- 100
x <- rbeta(N, 1, 1)
Ey <- 1/(1 + exp(2 - x))
y <- rbinom(N, 1, Ey)
G <- glm(y ~ x, family = binomial)
yhat <- predict(G, type = "response")
sum((y - yhat) * (yhat - mean(y)))
I get 0.02225644, so not too far off from zero, but far enough that I
have trouble attributing it to floating point errors
Minimize the sum of absolute residuals instead of the sum of squared
(Laplace-likelihood linear model or just some other extremum estimator
of the linear regression coefficients, as opposed to OLS estimation)
library(quantreg)
library(VGAM)
set.seed(1)
N <- 100
x <- rbeta(N, 1, 1)
Ey <- 2 - x
y <- VGAM::rlaplace(N, Ey, 1)
G <- quantreg::rq(y ~ x, tau = 0.5)
yhat <- predict(G, type = "response")
sum((y - yhat) * (yhat - mean(y)))
I get +4.027605
Support vector regression
library(e1071)
set.seed(1)
N <- 100
x <- rbeta(N, 1, 1)
Ey <- 2 - x
y <- VGAM::rlaplace(N, Ey, 1)
G <- e1071::svm(y ~ x)
yhat <- predict(G, type = "response")
sum((y - yhat) * (yhat - mean(y)))
I get +7.01538
But an OLS-estimated linear model gives that sum as basically zero
set.seed(1)
N <- 100
x <- rbeta(N, 1, 1)
Ey <- 2 - x
y <- rnorm(N, Ey, 1)
L <- lm(y ~ x)
yhat <- predict(L, type = "response")
sum((y - yhat) * (yhat - mean(y)))
I get -1.850733e-15