1

In the context of ordinary least square regression with observed responses $Y$ and residual errors $\hat \varepsilon$, the $R^2$ coefficient is defined as $ R^2 = 1-\frac{\|\hat \varepsilon\|^2}{\|Y - \bar Y\|^2}. $ This is non-decreasing in the addition of covariates to the model, and hence the adjusted $R^2$ coefficient defined by $ R^2_a = 1-\frac{\|\hat \varepsilon\|^2/(n-1)}{\|Y - \bar Y\|^2/(n-p-1)} $ is often prefered.

Why is this particular adjustment of $R^2$ the most commonly used? What is the underlying justification?


Notes

  • I'm mostly interested in historical and rigorous references. I haven't found a satisfactory answer on wikipedia or in the regression books I consulted.
  • I'm not looking for the one explanation to end all explanations; my goal is to get a good grasp on different understandings of $R^2_a$.
Olivier
  • 819

1 Answers1

0

The explained variance will have (n-p-1) degrees of freedom, while the total variance will have n-1 degrees of freedom. So, you normalize the sums of squares with these quantities. Look at multiple regression output description here. This is closely related to the F-test statistics reported for multiple regression.

Aksakal
  • 61,310
  • This is self-evident from the definition. Here I'm really asking about the fundamental why of why this adjustment is used instead of any other. Isn't there something more profound? – Olivier Feb 21 '18 at 03:39