2

I am running a supervised regression with cross validation, and wish to use $R^2$ as my performance metric. I am using Leave-P-Out cross validation, with P=2, which gives me approximately 4500 folds, of which I obtain an $R^2$ statistic for each.

I want to find a sensible way to generalize the explainable variance over the entire model. Initially, I had assumed the average over all folds would be sensible. However, the issue I have is that the range of $R^2$ scores appears unbounded <0. In some folds, I get a reasonable level of performance ($R^2$=0.6), and others I get enormously negative values ($R^2$ = -1900), no doubt because each fold is only being tested on 2 instances. As a result, the average $R^2$ over all 4500 folds is a negative number.

My question is, can I sensibly justify representing this in any other way to demonstrate that "when the model works, $R^2$=X"?

Given that a model predicting the expected value of the target would generate an $R^2$=0 , is there any practical difference between $R^2$ = 0 and $R^2$ = -1000? If not, is it (in)appropriate to treat all negative $R^2$ scores as 0, and thus avoiding washing away positive values where the model did work?

Dave
  • 62,186
  • 1
    Why do you consider it a "wrong" result that your average $R^2$ is $<0$? It could be that your model is a poor one. After all, $R^2<0$ means that you would be better off predicting the mean value of $y$ every time, no matter what values the predictors have. – Dave Sep 28 '21 at 14:25
  • 1
    @Dave - it's not that I consider it wrong, I know its a weak model. My question is more to do with how to represent performance when it does work, or whether that is even a useful question. – cookie1986 Sep 28 '21 at 14:28

1 Answers1

1

Yes, I say that it is wrong to consider $R^2 = -1000$ to be the same as $R^2 = 0$. In the latter case, model performance is no worse than naïvely guessing the overall mean every time, while the former indicates that all of your fancy modeling cannot even do as well as predicting average(a:a) (to use some Excel syntax) every time. That is, there is a way to get better performance while spending less to get it.

If your cross-validation shows that such performance is so common and/or severe that the average performance is dragged down, you just have evidence that your model does a poor job of predicting. This is disappointing, sure, but the whole reason we do validation is to catch this kind of poor performance. One thought could be to consider the median performance, if you are concerned about one severe "outlier" ruining everything.

Finally, watch out for what calculation you are doing for your out-of-sample $R^2$. While I disagree with the usual sklearn implementation and do believe my proposed calculation to have stronger motivation as a statistic or measure of performance that would be of interest, I concede that both calculations are likely to give similar answers in most circumstances. However, when your holdout set is just two points, there is a lot of room for having a markedly different mean of the holdout data than the training data. Since the mean minimizes the sum of squares, this means that the sklearn implementation is a lower bound on the equation I have proposed (their denominator cannot be larger than my denominator, and the numerators are the same), and your performance might improve, perhaps dramatically, if you use the $R^2$ calculation I prefer.

(Whether or not my calculation or any of these calculations should be called $R^2$ is a different story, and I am open to using different notation for these different statistics.)

$$ R^2_{\text{out-of-sample, Dave}}= 1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y_{\text{in-sample}} \right)^2 }\right) $$$$ R^2_{\text{out-of-sample, scikit-learn}}= 1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y_{\text{out-of-sample}} \right)^2 }\right) $$

Dave
  • 62,186