Why is $R^2$ in linear regression estimated from resubstitution?

Question

In linear regression, the coefficient of determination $R^2$ is a normalized measure for prediction accuracy. In machine learning, performance measures are not computed by estimating the performance on the same data that has been used for training, because this would yield a too optimistically biased estimator.

I wonder why in linear regression $R^2$ is estimated from the predictions on the training data that has been used to estimate the model parameters. I would expect this to be optimistically biased, especially in cases of a low sample size / number of parameters ratio, and situations of overfitting will thus go unnoticed.

Why is it not estimated with leave-one-out (aka "n-fold cross-validation") or bootstrap?

This is more a question about the history and culture of machine learning vs. statistics, I think. There's a few related questions on the site about this. — mkt, Jul 05 '21 at 17:41
e.g. https://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-learning — mkt, Jul 05 '21 at 17:42
It is as biased as other in-sample statistics. Depending on the application, it definitively makes sense to calculate out-of-sample or CV R-squared. I do it quite often. — Michael M, Jul 05 '21 at 19:32
@mkt Do you mean that a possible peference of a biased estimator is is due to "culture"? I guess that there are other features of the in sample estimator of $R^2$ that make it attractive over a less biased extimator. And is the issue of quality measure estimation discussed in the linked thread? — cdalitz, Jul 05 '21 at 21:03
Yes, I think traditional stats classes teach that these methods are valid because of theoretical work. The fact that the assumptions of that theory are frequently unmet doesn't really end up being emphasised, and so solutions like CV R2 end up being ignored. — mkt, Jul 06 '21 at 06:16
Cross-validating the $R^2$ is a standard procedure, called the PRESS statistic. It is discussed here on CV in many threads. — whuber, Mar 09 '23 at 18:41

Dave · Answer 1 · 2023-03-09T18:42:55.500

It is. For instance, the Python machine learning package sklearn has the sklearn.metrics.r2_score function that can be applied to either in-sample or out-of-sample data. Others have thought about an out-of-sample $R^2$ (1),(2). It even has a relationship to the common PRESS statistic.

As far as why an in-sample measure is used at all, there are a few reasons. One is that the difference between an in-sample and out-of-sample metric can shine light on overfitting issues. Another is that out-of-sample testing is a more advanced topic that, for better or for worse, isn't covered in the introductory statistics classes. Third, for simple models, the usual in-sample $R^2$ probably isn't such a high-biased estimator, and the little bit of optimistic bias might be offset (in some sense) by the ease of interpreting the value as the correlation between the predictions and true values in the OLS linear regression setting that is covered in introductory courses. I suspect that even adjusted $R^2$, which intends to account for the optimistic bias by penalizing the model for having many parameters, is not covered in the introductory classes.

Finally, it isn't totally clear what an out-of-sample $R^2$ should be. For instance, I totally disagree with the sklearn implementation (even if my formula and the sklearn formula should not differ by much, and a large disagreement would be a signal about some kind of data drift, which is enormously problematic for separate reasons). With that in mind, what exactly would you calculate as your out-of-sample $R^2$, and how do you explain that calculation to stakeholders who only took the introductory class and only know the usual $R^2$ formula?

Overall, I can see why in-sample $R^2$ values are calculated, even if I absolutely see value to more sophisticated forms of model evaluation.

Thanks for pointing out the problem of a proper definition of an out-of-sample $R^2$. My original idea was that an out-of-sample $R^2$ can be used as a normalized version for the MSE, but now I see that this does not even hold in the case of OLS, because it can become negative when the coefficients are not exactly chosen to minimize the within-sample MSE of all data. And the other interpretations of "variance explained" do not generalize either. I agree with your other thread that an out-of-sample variance should be used in the denominator when defining an out-of-sample $R^2$. — cdalitz, Mar 10 '23 at 10:16

score 1 · Answer 2 · answered Mar 09 '23 at 19:08

When overfitting is not an issue, we rarely care about the internal vs. external validity of the $R^2$ statistic.

Machine learning often confronts analyses where overfitting is a prominent issue. ML advocates calculating an $R^2$-like statistic using split-sample validation, cross-validation, or a number of resampling based techniques. However, calling these statistics the $R^2$ is a mistake. It's something different and requires considerably more assumptions to replicate the findings, such as the specific sample-splitting technique, the number of iterations, the seed used to generate the findings, software versions, etc. One also must consider assessing the stability of these estimates since the MCMC error needs to be ameliorated to produce generalizable findings.

These are computer-era considerations which the $R^2$ long preceeded.

Why is $R^2$ in linear regression estimated from resubstitution?

2 Answers2