2

I'm completely confused about how to calculate R-squared for given lists of predicted and actual values.

As an example, assume that my predicted values are: [3, 8, 10, 17, 24, 27] and my actual values are [2, 8, 10, 13, 18, 20].

According to wikipedia, I do the following:

  1. get the mean of the actual values (y-bar) = 14.8333
  2. compute the residual sum of squares (RSS). For each pair of values, I'm getting the difference, squaring it, and summing the results. i.e. (3-2)^2 + (8-8)^2 + (10-10)^2 and so on. For my data this is 102.
  3. Compute the total sum of squares (TSS). For each actual value, subtract it from the mean of the actual values, square the result, and sum all of these. i.e. (2-14.833)^2 + (8-14.833)^2 and so on. So TSS = 220.83333.
  4. R^2 = 1 - RSS/TSS = .53811

Contrast this method with one described here, which says I also need to be using the average of the predicted values, as well as what Excel gives using the RSQ formula (.9729).

Am I doing something wrong above? Which is the correct formula/method to use?

Dave
  • 62,186
Jer
  • 133
  • 1
    That applet and Excel are fitting a least squares line to these data, which is totally different: they aren't treating the first set as "predicted" values--they treat them as predictors. Because these are lousy predicted values, the least squares fit is a huge improvement, as reflected in the much larger value of $R^2$ it yields. – whuber Nov 30 '22 at 23:22

1 Answers1

2

There are multiple ways of calculating $R^2$ that are equivalent for ordinary least squares linear regression but are not equal in other circumstances.

I like $R^2=1-\frac{ SSRes }{ SSTotal }$, for reasons I discuss here. (Fair warning: this is not an introductory topic.)

However, some people like $R^2=(corr(y,\hat y))^2$. That appears to be the formula used in Excel.

If you fit by ordinary least squares, these two equations will be equal. If you do not fit by ordinary least squares, the two need not be equal.

The disagreement comes from the fact that your predictions are not great predictions of the true values, hence the low value of $R^2$ calculated the way I prefer. However, there is a solid linear relationship between the two, hence the high correlation.

As for what equation you should use to calculate your quantity of interest, that depends on what you find interesting! Both of these equations can give useful information, depending on what you want to know.

Dave
  • 62,186
  • I feel like I'm in over my head already, but are you saying that how the predicted values were obtained matters? Can I not just take any two lists of number and compute the R-squared? – Jer Nov 30 '22 at 23:28
  • 1
    Also, you lost me at "I like..." vs. "some other people like...". Isn't there just a definition of what this is, with no opinions involved? I mean, we don't have different opinions of how to calculate the sum of a list of numbers, right? – Jer Nov 30 '22 at 23:31
  • Sure, you can calculate $R^2$ for two equal-length lists of numbers, but using which formula? What do you want that calculation to tell you about your data? If you want to know how good the predictions are, use the formula I prefer. If you want to know if there is a linear relationship between the predictions and observations, square the correlation. (It gets even worse for out-of-sample $R^2$, as I discuss in the linked question.) – Dave Nov 30 '22 at 23:31
  • Yes, there are opinions. I disagree with how a popular Python package calculates $R^2$. The way they calculate what they claim to calculate is spot-on (I assume), so I can’t fault their implementation, but I disagree with them regarding that as being in the spirit of what $R^2$ should tell you. Ditto for how your Excel function calculates $R^2$. // You will benefit greatly from considering what you want your $R^2$ calculation to tell you about your data. Excel is correct when it tells you what it claims to tell you, but you might care about something else, and that’s okay. – Dave Nov 30 '22 at 23:33
  • Ok, thanks. It kind of sounds like these should not both be called "R-Squared" then, no? Coming from a non-stats background it just seems odd that we have this thing called "R-squared", which has multiple different definitions and formulas. Or am I completely confused about what R-Squared even is? – Jer Nov 30 '22 at 23:39
  • In other words, we have a function called "sum" which takes as input a list of numbers and is defined as adding all of them together, and there's no debate about that (I assume). But here we have a function called "r-squared" which takes as input two equal-length lists of numbers, and is defined as either A or B. Why is that? – Jer Nov 30 '22 at 23:42
  • 3
    The trouble is that there are multiple calculations that yield the same value in a simple setting, and each can be claimed to be the true $R^2$. When we then extend those ideas to more complicated work, that’s where the disagreements start. I do not see anything analogous for sums. – Dave Nov 30 '22 at 23:44
  • 2
    To add to Dave's last comment, $R^2$ has a standard meaning within the framework of ordinary least squares regression. You are working outside that framework. There appears to be one useful generalization that would work--it corresponds to your manual calculation--but by trying to apply software that assumes you are doing OLS, you are not getting the calculations you need. Indeed, even sums suffer from a similar problem: they can be defined for far more than real numbers, as in modular arithmetic; but if you ask Excel to perform sums in a finite field, it will get the wrong results. – whuber Dec 01 '22 at 00:33
  • 1
    @whuber - got it, thanks - I can appreciate that analogy (as I'm a little more well-versed in algebra than stats). Addition means one thing over the integers (4 + 4 = 8), but a different thing over the integers modulo 5 (4 + 4 = 3). This is kind of like that I guess. Thanks to both you and Dave for humoring me a bit. – Jer Dec 01 '22 at 00:38