1

I found two different and contradictory definitions / interpretations of $R2$. They cannot be true at the same time and I don't know how to tell which one is true.

  1. $R^2$ is just a square of Pearson's correlation. Hence its value ranges between $[0, 1]$
  2. $R^2$ is equal to $$1 - \dfrac{SSE}{SST}$$

and can be negative if the model is worse than random. For example - in Python's sklearn function r2_score can return a negative number.

Are those two different $R^2$s? Why are there two contradictory definitions? Can anyone explain it comprehensively?

User1865345
  • 8,202
patpal
  • 13
  • 1
    Just because two formulas are not algebraically equivalent does not mean they are contradictory: one can be a special case of the other, for instance. For 14 additional formulas for $R^2$ see https://stats.stackexchange.com/questions/70969. (Strictly speaking, you need to square those formulas for $\rho.$) – whuber Dec 27 '22 at 22:31
  • By contradiction I mean that one definition assumes $R2$ can be negative and the other that it ranges between 0 and 1 – patpal Dec 28 '22 at 08:14
  • Neither definition makes such an assumption. It merely turns out that, in some circumstances (where definition (1) is inapplicable), definition (2) can yield negative values. – whuber Dec 28 '22 at 14:55

2 Answers2

1

If you fit both slope and intercept with linear regression, both your definitions will be true (and both give the same value of $R^2$).

If you only fit one of those (slope or intercept) and constrain the other to a fixed value (often the intercept is forced to equal 0), then the second definition is correct but the first definition will be wrong. And in this case, yes $R^2$ can be negative: What does negative R-squared mean?

1

In a simple setting where you fit a linear model, using ordinary least squares, and include an intercept, multiple definitions of $R^2$ give equal values.

  1. In simple linear regression (just one slope and the intercept), Pearson correlation between the $x$ and $y$ values

  2. Squared Pearson correlation between predicted and true $y$

  3. Proportion of variance explained by the regression

  4. Your second definition: a comparison of the square loss of your model to the square loss of a naïve model that always predicts the overall (pooled, marginal, unconditional) mean of $y$

(There are others, as discussed in this link given in a comment by whuber.)

In simple settings, these give the same answer, so people use them interchangeably as the definition of $R^2$. However, in more complex settings, there are differences. I get into some of those differences here and explain my support for the last of the four.

Dave
  • 62,186