I've come across two ways that people calculate R-squared on a test set:
- Calculate the square of the correlation between predictions and actual values (in practice, I've seen people do this in R by regressing the test set Y on their test set predictions, and report the R-squared from R's lm summary)
- Use the sum of squares formula (the formula used by https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score)
When applied to the training set of a linear regression, those two approaches are identical (see Relationship between $R^2$ and correlation coefficient for example), so there is no problem using one formula or the other.
However, on a test set, where the average prediction error might be nonzero (and you could even see extreme cases where predictions are negatively correlated with the ground truth, in addition to the average prediction error being nonzero), approaches 1 and 2 can produce different results. In extreme cases, approach 1 could yield a positive R-squared while approach 2 yields a negative R-squared (see What does negative R-squared mean? for examples). Is there a generally accepted way to refer to these two quantities, i.e. to name and disambiguate them, so that it's clear what people mean when they refer to a model's "test set R-squared"?
Related question: How to calculate out of sample R squared? is similar to my question but doesn't exactly answer how to disambiguate the two R-squared formulas.
In case the motivation isn't clear, here is an R simulation illustrating the difference between the two approaches using fake data:
## When computing R-squared on a test set, the result can differ meaningfully based on which formula we use
## This happens because average prediction error on a test set is often nonzero, especially in real-world problems
## Which formula do we want to use? Should we report both versions of test set R-squared?
## Or, to keep things simple, should we report test set RMSEs and MAEs instead of R-squared?
set.seed(123123)
n_obs <- 500
This dataframe is a simulated test set
Imagine that the predictions come from a model trained on some other dataset (which, for simplicity, is not simulated here)
df <- data.frame(prediction=runif(n_obs, min=-5, max=5))
Note that the addition of the constant in this line means that our
average prediction error is not zero
This will often be the case on real-world test sets
df$actual = 1.25 + df$prediction + rnorm(n_obs, sd=2.5)
r2_general <- function(preds, actual) {
Same formula as https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score
return(1 - sum((actual - preds) ^ 2) / sum((actual - mean(actual))^2))
}
This is our average prediction error on the test set
It will be close to the constant in the df$actual = ... equation
mean(df$actual - df$prediction)
The model's predictions are positively correlated with the actual values,
but, importantly, the average prediction error is nonzero
plot(df$prediction, df$actual)
cor(df$prediction, df$actual)
cor(df$prediction, df$actual)^2
regression <- lm(actual ~ prediction, data=df)
The R-squared value in this regression output looks pretty good (R squared around 0.55),
but that hides the fact that the average prediction error is nonzero
Note that this R-squared is identical to cor(df$prediction, df$actual)^2
summary(regression)
This is lower than the R-squared from the regression output, because the average prediction error is nonzero
This calculation gives 0.44 instead of 0.55 -- the test set R-squared depends on which formula we use
r2_general(preds=df$prediction, actual=df$actual)
df$prediction_from_regression <- predict(regression, newdata=df)
Note that this agrees exactly with lm's R squared
Fitting a regression on our test set creates a new set of predictions,
called prediction_from_regression, which differ from our original predictions
Importantly, the average error of prediction_from_regression is zero
r2_general(preds=df$prediction_from_regression, actual=df$actual)
cor(df$prediction_from_regression, df$actual)^2
summary(regression)$r.squared
mean(df$actual - df$prediction) # Not zero
mean(df$actual - df$prediction_from_regression) # Zero