Bit confused on the concept of Deviance

Question

So, I understand what the deviance is; the deviance is simply the residual sum of squares. However, what I don't really get is the decomposition of the total sum of squares. That is $\sum_{i=1}^\infty $($y_i - ybar)^{2}$ = $\sum_{i=1}^\infty $($y_i - yhat_i)^{2}$ + $\sum_{i=1}^\infty $($yhat_i - ybar)^{2}$

So I understand the proof for this, what I don't really understand is what yhat_i - ybar represents. I know that y_i - ybar is the difference between the observed value of y and the sample mean of the observed values as a whole. I understand that the distinguishment between y_i and yhat_i is between say, the observed value of y and the line of best fit for the model for y. What I don't get is the similar distinguishment between yhat_i and ybar. My lecturer described y_i - ybar as "overall variability in the data," y_i - yhat_i as "left over variability," and yhat_i - ybar as "variability explained by our model." But how is the variability explained by the model? Is it just the difference between the expected value of y at a particular point and the sample mean?

StijnDeVuyst · Answer 1 · 2021-05-17T10:25:43.923

Here is what those sums of squares actually are for an example.

The question here is: why is $\color{red}{\text{SSR}} = \sum_{i=1}^n (\hat{y}_i-\bar{y})^2$ to be interpreted as the "variability explained by the model".

Well, the "model" that you assume, or are "tentatively entertaining" here, is a linear relation of the response $Y$ to the independent variable $X$. To be precise, the model is the assumption that your data $(x_i,y_i)$, $i=1,\ldots,n$ are actually observations from the independent Gaussian random variables $$ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\,, i=1,\ldots,n $$ with $\varepsilon_i\sim N(0,\sigma^2)$. This model makes the linear dependence of the response to the independent variable explicit. Whether this model is the true model or not, if it is correct in some sense or not is irrelevant for now. Regardless, if this model is assumed and fitted to the data, then it "explains" the data as the green line in the above slide. The left-over "unexplained" variability are the blue squares $\color{blue}{\text{SSE}}$.

However, if NO model is assumed, then the only way to explain the data is by its overall mean $\bar{y}$. This is equivalent to assuming an empty or vacuous model: $$ Y_i = \bar{Y} + \varepsilon_i'\,, $$ with $\varepsilon_i'\sim N(0,(\sigma')^2)$. Here, we do not make use of the knowledge we have of $X$. Now, the "unexplained" variability are the black squares SST.

So what we do here is actually making the following comparison: how much unexplained variability do we have when the linear model is assumed compared to the unexplained variability when no model (empty model) is assumed. The difference between the two, $$ \color{red}{\text{SSR}} = \text{SST} - \color{blue}{\text{SSE}} $$ is the "explained variability" by the linear model. It is how much we 'value' the importance of assuming the linear model over not assuming any model at all.

@Tim Could you think about this as basically like, essentially; one of the worst linear regression models you can get is the null model, which in this case the intercept is equal to the sample mean. Therefore, the LHS of the equation is the TOTAL variability between this awful null model and the data points. However, if we create a better model, like the line of best fit; we still have some variability, but we have explained a certain level of it (i.e. the difference between our line of best fit and the null model). Is this an okay interpretation? — Sam Connell, May 17 '21 at 12:18

score 1 · Accepted Answer · answered May 17 '21 at 10:31

The deviance can take various forms, not just the sum of squares. However, if you are using the sum of squares, then the decomposition similar to that in the Law of Total Variance:

$\sum_i (y_i - \hat{y}_i)^2$ is proportional to the unexplained variance, i.e. how much the observations vary around the predictions. It's known elsewhere as the calibration.
$\sum_i (\hat{y}_i - \bar{y})^2$ is proportional to the explained variance, i.e. how much the model's predictions vary around the global mean. In other words, how much does using the model change your predictions over just using the global mean? It's known elsewhere as the refinement.

The size of these parts of the variance, relative to each other, bears on how informative the model is: at the extremes, an explained variance of zero ($R^2 = 0$) means that the model is just predicting $\bar{y}$ everywhere, whereas an unexplained variance of zero ($R^2 = 1$) means that the model's predictions perfectly agree with the observations.

Could you think about this as basically like, essentially; one of the worst linear regression models you can get is the null model, which in this case the intercept is equal to the sample mean. Therefore, the LHS of the equation is the TOTAL variability between this awful null model and the data points. However, if we create a better model, like the line of best fit; we still have some variability, but we have explained a certain level of it (i.e. the difference between our line of best fit and the null model). Is this an okay interpretation? — Sam Connell, May 17 '21 at 12:18

Bit confused on the concept of Deviance

2 Answers2