5

I have code that calculates $R^2$ with summations $$R^2 = \frac{(\sum xy - \frac1n \sum x \sum y)^2}{(\sum x^2 - \frac1n \sum x \sum x) (\sum y^2 - \frac1n \sum y \sum y)},$$ which is equivalent to $$R^2 = \frac{cov(x, y) \cdot cov(x, y)}{var(x) \cdot var(y)}.$$

I know the code is correct by benchmarking, but I have never seen this form. Can someone please explain or provide a reference? Thanks!

FWIW, the code is built for speed. It does rolling regressions and can quickly find each summation by differencing a cumulative sum.

  • You've never seen which form? The first, or the second? The first form is a very dangerous way to calculate $R^2$. There are much more numerical stable one-step update methods. – cardinal Oct 15 '11 at 23:24
  • @cardinal -- I haven't seen either (maybe I shouldn't have converted to var/covar -- I thought it might save an answerer some time). I am more familiar with $R^2 = 1 - SS_{err}/SS_{tot}$ version. Why does the var/covar form work? – Richard Herron Oct 16 '11 at 00:44
  • Correlation is normalized covariance. Nothing more, nothing less. The $R^2 =1 -\frac{SSE}{SS_{total}}$ relates to regression with non-OLS conditions. Some people use r for normalized covariance and R for the extended definition. There are expected value identities that account for the OP's question. – Carl Nov 23 '16 at 06:02

1 Answers1

3

The correlation is the covariance scaled by the SDs, $r=\text{cor}(x,y)=\text{cov}(x,y)/[\text{SD}(x) \; \text{SD}(y)]$. The formula you cite follows immediately. A reference seems unnecessary.

Karl
  • 6,197