4

It is common to break down the total sum of squares (related to the total variance in $y$) in a regression problem, into components.

$$ y_i-\bar{y} = (y_i - \hat{y_i} + \hat{y_i} - \bar{y}) = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) $$

$$( y_i-\bar{y})^2 = \Big[ (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) \Big]^2 = (y_i - \hat{y_i})^2 + (\hat{y_i} - \bar{y})^2 + 2(y_i - \hat{y_i})(\hat{y_i} - \bar{y}) $$

$$SSTotal := \sum_i ( y_i-\bar{y})^2 = \sum_i(y_i - \hat{y_i})^2 + \sum_i(\hat{y_i} - \bar{y})^2 + 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$$

In a linear regression with the $\hat y_i$ estimated by ordinary least squares, $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]=0$. Therefore, in such a setting, it is common to refer to $\sum_i(y_i - \hat{y_i})^2$ as the residual or unexplained sum of squares, with the remaining $\sum_i(\hat{y_i} - \bar{y})^2$ considered the "explained" sum of squares. This leads to the interpretation of $R^2$ as the proportion of variance explained.

$$ \text{Proportion of variance explained} = \dfrac{\left.\sum_i(\hat{y_i} - \bar{y})^2\middle/N\right.}{\left.\sum_i ( y_i-\bar{y})^2/N\right.}$$$$ =\dfrac{SSExplained}{SSTotal}$$$$ =\dfrac{SSTotal - SSUnexplained - 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]}{SSTotal}$$$$ =\dfrac{SSTotal - SSUnexplained}{SSTotal} \space\space\bigg(\star\bigg)$$$$ = 1 - \dfrac{SSUnexplained}{SSTotal}$$$$ = 1 - \left(\dfrac{ \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }\right) = R^2 $$

However, step $\left(\star\right)$ relied on $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]=0$, which is not true for every regression model. This need not hold for regularized linear regression estimation, for instance, and I give a few other examples at the end.

A reasonable interpretation of this $R^2$ is as a measure of by how much the square loss changes. Start out by defiing $\frac{\sum_i\left( y_i-\bar y \right)^2 - \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }= 1 - \left(\frac{ \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }\right)$ as a value of interest. The $\frac{\sum_i\left( y_i-\bar y \right)^2 - \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }$ expression is of the form $\frac{\text{Starting Value} - \text{Ending Value}}{\text{Starting Value}}$. If we started with $\$200$ and wound up with $\$150$, we would say there was a $25\%$ reduction in the money. Thus, I say that if we start with $SSTotal = 200$ and wind up with $SSUnexplained = 150$, there was a $25\%$ reduction in sum of squares.

The $\sum_i\left( y_i-\hat y_i \right)^2$ as the residual or unexplained sum of squares makes total sense to me. This is what the regression model misses or "cannot explain" so whatever is left over, the $\sum_i(\hat{y_i} - \bar{y})^2 + 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$, should be what is explained, yet if $\sum_i(\hat{y_i} - \bar{y})^2$ is the explained sum of squares and $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]\ne 0$, then the total sum of squares is equal to something other than the explained and unexplained sums of squares.

This leave me unsure about what to make of $\sum_i(\hat{y_i} - \bar{y})^2$ on its own. When $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big] = 0$, then I get it. When $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]\ne 0$, I am stuck. Does $\sum_i(\hat{y_i} - \bar{y})^2$ always have an interpretation as an "explained" sum of squares? If so, what do we make of the $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$ in light of the above $R^2$ derivation sure seeming like a description of how much total variance is explained (the amount by which the variance is reduced) without relying on $2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]=0$, just on $\frac{\sum_i\left( y_i-\bar y \right)^2 - \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }= 1 - \left(\frac{ \sum_i\left( y_i-\hat y_i \right)^2 }{ \sum_i\left( y_i-\bar y \right)^2 }\right)=:R^2?$

EXAMPLES

Below, I give some examples where $\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]\ne 0\ne 0$. There are a few GLMs (all with intercepts) estimated through the usual maximum likelihood techniques, a linear model (with an intercept) whose coeficients are estimated using a technique other than OLS, and a support vector regression. Every time, $\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$ is different enough from zero to convince me that it's not just a floating point arithmetic issue, especially given how close to zero $\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$ is in the OLS linear regression case at the end.

# Log-link in a GLM that minimizes the sum of squared residuals
#
set.seed(1)
N <- 100
x <- rbeta(N, 1, 1)
Ey <- exp(4 - x)
y <- rnorm(N, Ey, 1)
G <- glm(y ~ x, family = gaussian(link = "log"))
yhat <- predict(G, type = "response")
sum((y - yhat) * (yhat - mean(y))) 
#
# I get 20.45738

Logistic regression

Sum of squared residuals is related to the Brier score and is legitimate

to calculate for a logistic regression

set.seed(1) N <- 100 x <- rbeta(N, 1, 1) Ey <- 1/(1 + exp(2 - x)) y <- rbinom(N, 1, Ey) G <- glm(y ~ x, family = binomial) yhat <- predict(G, type = "response") sum((y - yhat) * (yhat - mean(y)))

I get 0.02225644, so not too far off from zero, but far enough that I

have trouble attributing it to floating point errors

Minimize the sum of absolute residuals instead of the sum of squared

(Laplace-likelihood linear model or just some other extremum estimator

of the linear regression coefficients, as opposed to OLS estimation)

library(quantreg) library(VGAM) set.seed(1) N <- 100 x <- rbeta(N, 1, 1) Ey <- 2 - x y <- VGAM::rlaplace(N, Ey, 1) G <- quantreg::rq(y ~ x, tau = 0.5) yhat <- predict(G, type = "response") sum((y - yhat) * (yhat - mean(y)))

I get +4.027605

Support vector regression

library(e1071) set.seed(1) N <- 100 x <- rbeta(N, 1, 1) Ey <- 2 - x y <- VGAM::rlaplace(N, Ey, 1) G <- e1071::svm(y ~ x) yhat <- predict(G, type = "response") sum((y - yhat) * (yhat - mean(y)))

I get +7.01538

But an OLS-estimated linear model gives that sum as basically zero

set.seed(1) N <- 100 x <- rbeta(N, 1, 1) Ey <- 2 - x y <- rnorm(N, Ey, 1) L <- lm(y ~ x) yhat <- predict(L, type = "response") sum((y - yhat) * (yhat - mean(y)))

I get -1.850733e-15

Dave
  • 62,186
  • I find it appaling how many machine learnists do not account for intercepts appropriately in their models. If we focus on penalized regression, it makes little sense to apply the same penalty to an intercept as if it were a covariate term. In fact, many didactic texts advise the analysts to center the response and covariates, this ensures orthogonality of the cross-product term in the variance decomposition. But... maybe this is overlooked. – AdamO Nov 01 '23 at 22:18
  • @AdamO I’m not seeing the relevance of your comment. Is your point that the regularized models do have $SSTotal = SSUnexplained + SSExplained?$ – Dave Nov 01 '23 at 22:24
  • as you correctly point out, the interpretation hinges on (Y-Y-hat)(Y-hat - Y-bar) summing to 0. If the intercept term is a true free parameter, then you will guarantee that Y-hat averages to Y-bar each time and the problem is solved. The question you reference is deeply linked. The only scenario I can think of is a regularization that drops the intercept for some reason. You may have to provide an example if this isn't the single case where this matters. – AdamO Nov 01 '23 at 22:55
  • @AdamO I have posted a few examples. – Dave Nov 02 '23 at 00:17
  • your GLM example is wrong. Inspect the default output of prediction when link=log in a Gaussian GLM. It's on the linear predictor scale!! – AdamO Nov 02 '23 at 00:22
  • 1
    Anyway, consider in the case of model misspecification that there is now a bias component. So (Y-Y-bar)^2 = Variance + Bias^2. And the variance still has the desired decomposition. – AdamO Nov 02 '23 at 00:29
  • @AdamO Even with those errors (I think) corrected, there is a definite lack of $\sum_i\left(y_i -\hat y_i\right)\left(\hat y_i - \bar y\right) = 0$. Maybe the logistic regression can be attributed to floating point errors, but the others are quite clearly nonzero, showing that $\sum_i\left(y_i -\hat y_i\right)\left(\hat y_i - \bar y\right) \ne 0$ in general. $//$ I really have no idea what you're trying to show with the bias-variance decompision but am eager to learn. – Dave Nov 02 '23 at 00:49

1 Answers1

3

This answer is about the OLS estimator in the classical linear regression model and might be obsolete after the OP's edit. However, I think it could be a useful starting point for further analysis.


Let $P$ denote the orthogonal projector onto the column space of the design matrix $X \in \mathbb R^{n \times k}$, i.e. $P = X\left(X^\top X\right)^{-1}X^\top$. Note that $P$ is symmetric and idempotent.

We have $$ \sum_{i=1}^n(y_i-\hat y_i) \cdot \hat y_i = (y - P y)^\top(P y) = y^\top P y - y^\top P^\top Py = y^\top P y - y^\top P y = 0. $$ We also have $$ \sum_{i=1}^n(y_i-\hat y_i) = (y - P y)^\top \mathbf 1_n = y^\top \mathbf 1_n - y^\top P^\top \mathbf 1_n = y^\top \mathbf 1_n - y^\top (P \mathbf 1_n). $$ If $\mathbf 1_n$ is in the column space of $X$, e.g. when the linear regression model contains an intercept, we have $P \mathbf 1_n = \mathbf 1_n$ and thus $\sum_{i=1}^n(y_i-\hat y_i) = 0$.

Therefore, $$ \sum_{i=1}^n (y_i - \hat y_i) \cdot (\hat y_i - \bar y) = \sum_{i=1}^n (y_i - \hat y_i) \cdot \hat y_i - \bar y \cdot \sum_{i=1}^n (y_i - \hat y_i) = 0 $$ holds if $\mathbf 1_n$ is in the column space of $X$, which I think is the rule rather then the exception.

statmerkur
  • 5,950