How to distinguish two versions of R-squared calculated on test set?

Question

I've come across two ways that people calculate R-squared on a test set:

Calculate the square of the correlation between predictions and actual values (in practice, I've seen people do this in R by regressing the test set Y on their test set predictions, and report the R-squared from R's lm summary)
Use the sum of squares formula (the formula used by https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score)

When applied to the training set of a linear regression, those two approaches are identical (see Relationship between $R^2$ and correlation coefficient for example), so there is no problem using one formula or the other.

However, on a test set, where the average prediction error might be nonzero (and you could even see extreme cases where predictions are negatively correlated with the ground truth, in addition to the average prediction error being nonzero), approaches 1 and 2 can produce different results. In extreme cases, approach 1 could yield a positive R-squared while approach 2 yields a negative R-squared (see What does negative R-squared mean? for examples). Is there a generally accepted way to refer to these two quantities, i.e. to name and disambiguate them, so that it's clear what people mean when they refer to a model's "test set R-squared"?

Related question: How to calculate out of sample R squared? is similar to my question but doesn't exactly answer how to disambiguate the two R-squared formulas.

In case the motivation isn't clear, here is an R simulation illustrating the difference between the two approaches using fake data:

## When computing R-squared on a test set, the result can differ meaningfully based on which formula we use
## This happens because average prediction error on a test set is often nonzero, especially in real-world problems
## Which formula do we want to use?  Should we report both versions of test set R-squared?
## Or, to keep things simple, should we report test set RMSEs and MAEs instead of R-squared?
set.seed(123123)
n_obs <- 500
This dataframe is a simulated test set
Imagine that the predictions come from a model trained on some other dataset (which, for simplicity, is not simulated here)
df <- data.frame(prediction=runif(n_obs, min=-5, max=5))
Note that the addition of the constant in this line means that our
average prediction error is not zero
This will often be the case on real-world test sets
df$actual = 1.25 + df$prediction + rnorm(n_obs, sd=2.5)
r2_general <- function(preds, actual) {
Same formula as https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score
return(1 - sum((actual - preds) ^ 2) / sum((actual - mean(actual))^2))
}
This is our average prediction error on the test set
It will be close to the constant in the df$actual = ... equation
mean(df$actual - df$prediction)
The model's predictions are positively correlated with the actual values,
but, importantly, the average prediction error is nonzero
plot(df$prediction, df$actual)
cor(df$prediction, df$actual)
cor(df$prediction, df$actual)^2
regression <- lm(actual ~ prediction, data=df)
The R-squared value in this regression output looks pretty good (R squared around 0.55),
but that hides the fact that the average prediction error is nonzero
Note that this R-squared is identical to cor(df$prediction, df$actual)^2
summary(regression)
This is lower than the R-squared from the regression output, because the average prediction error is nonzero
This calculation gives 0.44 instead of 0.55 -- the test set R-squared depends on which formula we use
r2_general(preds=df$prediction, actual=df$actual)
df$prediction_from_regression <- predict(regression, newdata=df)
Note that this agrees exactly with lm's R squared
Fitting a regression on our test set creates a new set of predictions,
called prediction_from_regression, which differ from our original predictions
Importantly, the average error of prediction_from_regression is zero
r2_general(preds=df$prediction_from_regression, actual=df$actual)
cor(df$prediction_from_regression, df$actual)^2
summary(regression)$r.squared
mean(df$actual - df$prediction)  # Not zero
mean(df$actual - df$prediction_from_regression)  # Zero

As far as I'm concerned, the justification for the term $R^2$ is within the standard linear regression model with intercept, on the data used for fitting, where it cannot be negative. Whenever people use the term outside that context, they should not assume that it is clear what it means, and define their use explicitly (or even better give it another name). I know some people just use it without explicit explanation, but it may be their fault, not the reader's, if they are not understood. — Christian Hennig, Mar 30 '22 at 21:17
Another related question+answer that could be helpful to someone reading this post: https://stats.stackexchange.com/a/551916/9330 — Adrian, Mar 30 '22 at 21:35
@Adrian Could you please explain how my post about $R^2$ in the nonlinear case comes into play here? — Dave, Mar 30 '22 at 21:52
@Dave yes, great question! The connection I see is that the "Other" term that you emphasize in your answer can be nonzero even in the case of a linear regression when we are evaluating that regression on the test set. In other words, we can be certain that the "Other" term in your answer is zero only if we are (a) dealing with a linear regression and (b) evaluating that linear regression on the training set. Do you agree? — Adrian, Mar 30 '22 at 23:49
I’m with you so far and am starting to see the connection. Anything else? — Dave, Mar 31 '22 at 00:17
@Dave I think the next logical step is that your conclusion about R-squared ("it would be incorrect to interpret as the proportion of variance explained") also applies to R-squared calculated on a test set (regardless of whether the model is linear or nonlinear). Does that seem correct to you? — Adrian, Mar 31 '22 at 16:52
Yes, we would lack the desired $TSS = SSRes + SSReg$ decomposition in that case. — Dave, Mar 31 '22 at 17:43
https://stats.stackexchange.com/questions/580757/why-are-my-test-data-r-squareds-identical-despite-using-different-training-data is related and might be interesting to anyone reading this question — Adrian, Jul 05 '22 at 21:02

score 1 · Accepted Answer · answered Mar 15 '23 at 20:54

There seems to be a lack of consistency in what $R^2$ should mean outside of the simple settings.

The simplest case is simple linear regression, where the squared Pearson correlation between the feature and outcome equals the squared Pearson correlation between the true and predicted outcomes. These both equal the "sum of squares formula" you mention from sklearn: $1 - \left[\left(\overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 \right)\Bigg/\left( \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 \right) \right]$. All three of these turn out to be equal to the proportion of variance in $y$ explained by the regression. Thus, in the simple case, there are four notions of what $R^2$ means.

Moving to a more general setting, there might be more than one feature, so the correlation between the feature and outcome no longer makes sense. However, all three other notions can be calculated, and each has a legitimate claim to being called $R^2$. I would go with the "sum of squares" formula, since it has a nice connection with a comparison to a "must beat" model that makes sense to me, but all three notions can be defended. I also like the connection this calculation has to the reduction in error rate that a classifier has (and the fact that the reduction in error rate I discuss at the link relates to a reasonable definition of the familiar $R^2$ statistic makes me like the reduction in error rate statistic all the more).

The good news is that you're always allowed to define a statistic. If you have a reason to want to know the squared correlation between the true and predicted values, feel free to define and calculate such a statistic. If you have a reason to want to use the "sum of squares" formula like sklearn uses, define it and use it. If you want to modify the sklearn formula in the way that makes sense to me for out-of-sample testing, define the formula and use it. If you want to decompose the total sum of squares to get the $SSRes$, $SSReg$, and $Other$ term I discuss here in order to discuss the proportion of the variance in $y$ that is explained by the model, define it and use that formula.

But as far as there being a name that basically every statistician knows, like how I can write $\bar x$ without ambiguity, no, I do not sense that for $R^2$.

How to distinguish two versions of R-squared calculated on test set?

This dataframe is a simulated test set

Imagine that the predictions come from a model trained on some other dataset (which, for simplicity, is not simulated here)

Note that the addition of the constant in this line means that our

average prediction error is not zero

This will often be the case on real-world test sets

Same formula as https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score

This is our average prediction error on the test set

It will be close to the constant in the df$actual = ... equation

The model's predictions are positively correlated with the actual values,

but, importantly, the average prediction error is nonzero

The R-squared value in this regression output looks pretty good (R squared around 0.55),

but that hides the fact that the average prediction error is nonzero

Note that this R-squared is identical to cor(df$prediction, df$actual)^2

This is lower than the R-squared from the regression output, because the average prediction error is nonzero

This calculation gives 0.44 instead of 0.55 -- the test set R-squared depends on which formula we use

Note that this agrees exactly with lm's R squared

Fitting a regression on our test set creates a new set of predictions,

called prediction_from_regression, which differ from our original predictions

Importantly, the average error of prediction_from_regression is zero

1 Answers1

Linked