What does it mean when SSR>SST?

Question

Following is an example of the observed and predicted values for my variable y (in R).

> df <- data.frame(
        yobs = c(29.08,21.8371611111111,41.1785861111111,
                 60.5846,42.8531777777778,35.6931861111111,15.1174416666667,
                 10.9228777777778,17.6561777777778,29.2195694444444,
                 4.48469166666667,24.2387083333333,57.5354805555556,29.4075305555556,
                 26.7835888888889,28.9258111111111,37.1471972222222,
                 30.5934277777778,9.22973333333333,57.0615833333333,25.5308722222222,
                 40.429725,11.9677777777778,24.6323805555556,43.5893833333333,
                 25.0586194444444,21.5084305555556,28.5317944444444,
                 17.2729027777778,63.3144833333333,18.7004027777778,15.7129944444444,
                 15.6565138888889,27.4428777777778,55.2504027777778,
                 33.6584277777778,10.0764861111111,0.956327777777778,
                 30.4974416666667,40.2348166666667,12.0094138888889,16.0595388888889,
                 6.70388888888889,61.6930861111111,45.5002555555556,
                 34.9412638888889),
       ypred = c(37.9778265746194,20.4344267726767,
                 24.2583278821139,81.3820676947289,35.9664230956281,48.2550410428931,
                 13.1322244321762,11.2277223100893,17.3847974374533,
                 36.2654061390013,13.6891124226893,36.93587791295,42.4778772806932,
                 60.4805857896792,50.8097811774078,31.2983753184525,
                 39.4901787588643,36.0489111859141,5.16132056902304,67.6280256177873,
                 46.6873141264554,56.9305336644725,17.1904930898903,
                 17.8447406631152,81.8167881348895,21.6446504197869,17.2125579607197,
                 27.8854475743327,25.6223558489715,39.1097052984601,
                 14.3303635195841,8.3085889213573,14.7616830600331,29.6236752760362,
                 36.4710794579997,32.1294471109381,21.9208933069802,
                 8.17174771983545,30.3954470923862,25.2201086957305,13.7007923212405,
                 16.2708330581924,11.7006605896811,71.8768937208489,
                 77.2434241984382,30.0205384313346))
> SST <- sum((mean(df$yobs)-df$yobs)^2)
> SSE <- sum((df$yobs-df$ypred)^2)
> SSR <- sum((df$ypred-mean(df$yobs))^2)
> SST
[1] 11600.41
> SSR
[1] 18976.75
> SSE
[1] 8199.87

I know that SST=SSE+SSR. But something is wrong with my values. A scatter plot shows a reasonable agreement between ypred and yobs. Is there any reason why the relationship is not applicable in this case? The R² are understandably off. Am I making some mistake here?

Your formula for SSR is incorrect. This might become obvious if you were to check your formulas against a tiny dataset with simple values for which you can easily perform calculations by hand. — whuber, Oct 13 '21 at 15:35
Not in this example, but more generally $SST < SSR$ is related to "negative $R^2$" where the model is so poor that predicting a constant value equal to the mean of the observations would have performed better. This cannot happen with ordinary linear regression. See a couple of earlier questions: https://stats.stackexchange.com/questions/86305/negative-r-squared-contradicts-ssa-sst and https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative — Henry, Oct 13 '21 at 15:58
@whuber How do you figure? It seems to give the right answer in code and align with the mathematical definition : $SSR=\overset{N}{\underset{i=1}{\sum}}(\hat{y_i} - \bar{y})^2$. — Dave, Mar 24 '23 at 13:04
@Dave Take a closer look at SSR. By using the mean of the observations as the reference for the predictions it implicitly assumes the predictions have the same mean as the observations. That's wrong except when equality is guaranteed by, for instance, making an OLS prediction with an intercept term included. — whuber, Mar 24 '23 at 13:08

Dave · Answer 1 · 2023-03-24T13:45:29.153

QUICK

There is an interesting comment about what makes sense for a definition of $SSR$. However, given the definitions used in the code, if $SSR > SST$, then $\overset{N}{\underset{i=1}{\sum}}\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]<0$, as $SST := \overset{N}{\underset{i=1}{\sum}} ( y_i-\bar{y})^2 = SSE + SSR + 2\overset{N}{\underset{i=1}{\sum}}\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$, and $SSE\ge 0$..

LONGER

Your predicted and observed values might look like they correlate when you look at a plot, but that does not mean they agree. If they agree, they should be about equal and comform to the line $y=\hat y$ (slope of one, intercept of zero). Let's take a look.

library(ggplot2)
d <- data.frame(
  yobs = c(29.08,21.8371611111111,41.1785861111111,
           60.5846,42.8531777777778,35.6931861111111,15.1174416666667,
           10.9228777777778,17.6561777777778,29.2195694444444,
           4.48469166666667,24.2387083333333,57.5354805555556,29.4075305555556,
           26.7835888888889,28.9258111111111,37.1471972222222,
           30.5934277777778,9.22973333333333,57.0615833333333,25.5308722222222,
           40.429725,11.9677777777778,24.6323805555556,43.5893833333333,
           25.0586194444444,21.5084305555556,28.5317944444444,
           17.2729027777778,63.3144833333333,18.7004027777778,15.7129944444444,
           15.6565138888889,27.4428777777778,55.2504027777778,
           33.6584277777778,10.0764861111111,0.956327777777778,
           30.4974416666667,40.2348166666667,12.0094138888889,16.0595388888889,
           6.70388888888889,61.6930861111111,45.5002555555556,
           34.9412638888889),
  ypred = c(37.9778265746194,20.4344267726767,
            24.2583278821139,81.3820676947289,35.9664230956281,48.2550410428931,
            13.1322244321762,11.2277223100893,17.3847974374533,
            36.2654061390013,13.6891124226893,36.93587791295,42.4778772806932,
            60.4805857896792,50.8097811774078,31.2983753184525,
            39.4901787588643,36.0489111859141,5.16132056902304,67.6280256177873,
            46.6873141264554,56.9305336644725,17.1904930898903,
            17.8447406631152,81.8167881348895,21.6446504197869,17.2125579607197,
            27.8854475743327,25.6223558489715,39.1097052984601,
            14.3303635195841,8.3085889213573,14.7616830600331,29.6236752760362,
            36.4710794579997,32.1294471109381,21.9208933069802,
            8.17174771983545,30.3954470923862,25.2201086957305,13.7007923212405,
            16.2708330581924,11.7006605896811,71.8768937208489,
            77.2434241984382,30.0205384313346))
ggplot(d, aes(x = ypred, y = yobs)) +
  geom_point() +
  geom_abline(
    slope = 1, 
    intercept = 0,
    col = 'red'
  )

You're right; that identity line seems to fit the data decently. Now let's do a regression with the true and predicted values and plot the regression line.

library(ggplot2)
d <- data.frame(
    yobs = c(29.08,21.8371611111111,41.1785861111111,
             60.5846,42.8531777777778,35.6931861111111,15.1174416666667,
             10.9228777777778,17.6561777777778,29.2195694444444,
             4.48469166666667,24.2387083333333,57.5354805555556,29.4075305555556,
             26.7835888888889,28.9258111111111,37.1471972222222,
             30.5934277777778,9.22973333333333,57.0615833333333,25.5308722222222,
             40.429725,11.9677777777778,24.6323805555556,43.5893833333333,
             25.0586194444444,21.5084305555556,28.5317944444444,
             17.2729027777778,63.3144833333333,18.7004027777778,15.7129944444444,
             15.6565138888889,27.4428777777778,55.2504027777778,
             33.6584277777778,10.0764861111111,0.956327777777778,
             30.4974416666667,40.2348166666667,12.0094138888889,16.0595388888889,
             6.70388888888889,61.6930861111111,45.5002555555556,
             34.9412638888889),
    ypred = c(37.9778265746194,20.4344267726767,
              24.2583278821139,81.3820676947289,35.9664230956281,48.2550410428931,
              13.1322244321762,11.2277223100893,17.3847974374533,
              36.2654061390013,13.6891124226893,36.93587791295,42.4778772806932,
              60.4805857896792,50.8097811774078,31.2983753184525,
              39.4901787588643,36.0489111859141,5.16132056902304,67.6280256177873,
              46.6873141264554,56.9305336644725,17.1904930898903,
              17.8447406631152,81.8167881348895,21.6446504197869,17.2125579607197,
              27.8854475743327,25.6223558489715,39.1097052984601,
              14.3303635195841,8.3085889213573,14.7616830600331,29.6236752760362,
              36.4710794579997,32.1294471109381,21.9208933069802,
              8.17174771983545,30.3954470923862,25.2201086957305,13.7007923212405,
              16.2708330581924,11.7006605896811,71.8768937208489,
              77.2434241984382,30.0205384313346))
L <- lm(d$yobs ~ d$ypred)
ggplot(d, aes(x = ypred, y = yobs)) +
    geom_point() +
    geom_abline(
        slope = summary(L)$coef[2, 1],
        intercept = summary(L)$coef[1, 1],
        col = 'blue'
    )

That fit is not amazing, but it does look better to me, particularly when you consider that it is the vertical distance, not perpendicular distance, that is considered. In particular, the points to the right of the plot have a blue (regression) line that appears in the image, while the red (identity) line is off the chart to the way right of the first plot.

Consequently, the points in your data frame do not really conform to the identity line.

This is why, when you get into complicated situations, how you calculate $R^2$ matters. In the case of OLS linear regressions with an intercept, two common calculations agree.

$$ R^2 =\left(\text{corr}\left(\hat y, y\right)\right)^2\\ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) = 1-\dfrac{SSE}{SST} $$

When you get into more complicated situations, these do not agree.

SST <- sum((mean(d$yobs)-d$yobs)^2)
SSE <- sum((d$yobs-d$ypred)^2)
SSR <- sum((d$ypred-mean(d$yobs))^2)
cor(d$yobs, d$ypred) # 0.7665833
1 - SSE/SST          # 0.2931399

As you inferred from looking at the plot and thinking that the line of best fit (blue regression line) did have a decent fit to the points, there is a fairly strong correlation between the predictions and true values: $\approx 0.77$. However, when you calculate according to the equation that divides SSE by SST, you get a much weaker result of $\approx 0.29$, suggesting that your observed and predicted values do not agree to the extent that the plot my first suggest.

In the extreme, you can have silly results where $y=(1,2,3)$ and $\hat y = (101, 102, 103)$ have perfect correlation yet disagree terribly. I give plots here that have perfect squared correlation between predictions and observations, yet the predictions are terrible. Consequently, I do not believe the correlation between predicted and true values to be a useful measure of model performance (though it might give insight into how to correct your predictions, such as subtracting $100$ every time in the above example, so I would not totally write off $\left(\text{corr}\left(\hat y, y\right)\right)^2$). I would go with $1-SSE/SST$, where the above example (and linked plots) will give an awful value less than zero to flag the predictions as not aligning with the true values.

Overall, there is no issue with your numbers or calculations.

To address what it means for $SSR$ to exceed $SST$, let's look at the decomposition of the total sum of squares, which I have copied from another answer of mine.

$$ y_i-\bar{y} = (y_i - \hat{y_i} + \hat{y_i} - \bar{y}) = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) $$

$$( y_i-\bar{y})^2 = \Big[ (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) \Big]^2 = (y_i - \hat{y_i})^2 + (\hat{y_i} - \bar{y})^2 + 2(y_i - \hat{y_i})(\hat{y_i} - \bar{y}) $$

$$SSTotal := \overset{N}{\underset{i=1}{\sum}} ( y_i-\bar{y})^2 = \overset{N}{\underset{i=1}{\sum}}(y_i - \hat{y_i})^2 + \overset{N}{\underset{i=1}{\sum}}(\hat{y_i} - \bar{y})^2 + 2\overset{N}{\underset{i=1}{\sum}}\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$$

$$ :=SSE + SSR + Other $$

We know that $SSE\ge0$. Thus, if $SSR>SST$, then $Other < 0$ for the two sides of the equation to be equal. As the algebra says must be the case, this is true, and the SST is equal to the sum of SSE, SSR, and Other term.

Other <- 2 * sum((d$yobs - d$ypred)*(d$ypred - mean(d$yobs)))
Other             # -15576.2
SSE + SSR + Other # 11600.41
SST               # 11600.41

What does it mean when SSR>SST?

1 Answers1