0

There is a statement that got my attention recently, which is "ANOVA is just linear regression".

I was watching this video that seemed to explain the relationship between the two topics.

At one point the teacher explains the SSE breakdown between $\text{SSE}_{Reg}$ (i.e. the variability explained by the model) and $\text{SSE}_{Reg}$ (i.e. the variability explained by the residuals), and explains that both quantities follow a chi-squared distribution, the former with $1$ degrees of freedom and the latter with $n-2$. I want to fully understand why.

I have seen other resources and everyone states the degrees of freedom, but nobody really explains why. I have some ideas, but I am not really sure and I would like to validate them here.

So, using the same notation in the video:

  • $\hat{y_i}$ is the model prediction, following the equation $\hat{y_i} = b_0 + b_1x_i$
  • $\bar{y}$ is the average response variable, looking at data
  • $y_i$ is one data point in the original data

Now:

  • $\text{SSE}_{Reg}$

    • Formula: $\sum_{i=1}^n (\hat{y_i} - \bar{y})^2$
    • DFs: $1$
    • Reasoning: In this case $\bar{y}$ is a constant, so it doesn't count towards the degrees of freedom. The only variable parameter is $\hat{y_i}$ which in turn comes from $\hat{y_i} = b_0 + b_1x_i$.
      Does this expression have 1 degrees of freedom because we are assuming the null hypothesis $H_0$ which is $b_1 = 0$? That is, under the null, $\hat{y_i} = b_0$ and $b_0$ is the degree of freedom everyone is talking about?
  • $\text{SSE}_{Res}$

    • Formula: $\sum_{i=1}^n (\hat{y_i} - {y_i})^2$
    • DFs: $n-2$
    • Reasoning: In this case we have all the $y_i$ which can vary, and give us a total of $n$. Is it correct to state that in general if I have a quantity and $p$ variables I actually only need $p-1$ of them? So in this case I need $n - 1$? Still, it is not $n-2$...
      On the other hand we have \hat{y_i} which had $1$ df (for what we said above). How do we get to $n-2$? Is it because if I consider that \hat{y_i} is already estimated then its df is subtracted, getting to $n-2$?

Can anyone help me on this?

rusiano
  • 564
  • 2
  • 14
  • 1
    Ultimately you are asking why, in simple linear regression, the F-statistic used to test the significance of model follows $F_{1, n - 2}$ under $H_0$. The more general case (that is, when the model has more than $1$ predictors) has been mathematically treated in the first part of this answer. Also check this answer to understand why the "degrees of freedom" of residual sum of squares is $n - 2$. – Zhanxiong Mar 19 '24 at 12:18
  • This is a special case of Cochran's Theorem. – whuber Mar 19 '24 at 13:23
  • Your edited title is a wrong claim: the df is $n - 2$, not $n - 1$. – Zhanxiong Mar 19 '24 at 14:45

0 Answers0