How should $R^2$ score of training data compare to the $R^2$ score of test data?

Question

I've build two models using Support Vector Machines, one with 'linear' kernel and the other with 'rbf' kernel. The r2 score of the test data in both cases is pretty much equal, around 0.82, but the score on training data for the 'linear' kernel is 0.84, while for the 'rbf' kernel is around 0.94.
I understand that overfitting to the training set is possible, but shouldn't that yield lower r2 scores on the test set? In my case, which model would be deemed better?

EDIT: The models are fitted using GridSearchCV from sklearn, with 5-fold cross-validation.
The MSE for the 'linear' kernel on training set is 6e-3, and 8e-3 for the test set.
The MSE for the 'rbf' kernel on training set is 1e-3, and 6e-3 for the test set.

How do the MSEs compare? // It is ambiguous how to calculate out-of-sample $R^2$. For instance, sklearn and I disagree about its calculation. How do you calculate out-of-sample $R^2$, and what do you hope to learn from the value? Note that you do not have the usual “proportion of variance explained” interpretation except in a limited number of cases. — Dave, Jan 13 '23 at 13:59
Interesting, I didn't even think that much about r2 on test data. For the 'linear' kernel, MSE is 6e-3 for train, and 8e-3 for test. For the 'rbf' kernel, MSE is 1e-3 for train, and 6e-3 for test. — Sjotroll, Jan 13 '23 at 14:33
@Dave I now defined a custom r2 score function which takes the mean of the train set instead of the test set, and the results are the same, the difference is in the 4th or 5th decimal place. — Sjotroll, Jan 14 '23 at 10:32
I thought it might have to do with differing ways of calculating the $R^2$, but it doesn’t. Hopefully the answer I just posted helps clarify what’s happening and why you can have your observed results without breaking math. — Dave, Jan 14 '23 at 15:23

Dave · Accepted Answer · 2023-01-14T22:54:50.297

I’ve written on Stack Exchange multiple times that $R^2$ is just a monotonic transformation of the $\operatorname{MSE}$, so having higher/lower $\operatorname{MSE}$ is equivalent to lower/higher $R^2$ (when $R^2$ is calculated the way I believe $R^2$ should be calculated).

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

This, however, assumes a calculation on the same data set. When you calculate on different data sets, the denominator terms are different, meaning that the train and test $R^2$ values are not the same monotonic transformation of the $\operatorname{MSE}$. Consequently, it does not make sense to compare the two values any more than it makes sense to compare $R^2$ and $\operatorname{RMSE}$. Yes, both measure square loss, but they do so in different ways.

However, being interested in $R^2$ tells me that you are interested in square loss. Fortunately for you, then, it makes perfect sense to compare the train and test $\operatorname{MSE}$ values. For instance, when you use the “rbf” kernel, you wind up with eight times as high of a test $\operatorname{MSE}$ than train $\operatorname{MSE}$, suggesting quite a bit worse of an out-of-sample fit compared to in-sample, which might suggest overfitting to coincidences in the training data as opposed to modeling the true phenomenon.

score 1 · Answer 2 · answered Jan 13 '23 at 14:49

1

You are correct that lower $R^2$ (or other performance metrics) in a test set than the training set is a sign of overfitting. This comparison done by model, however: the fact that two different models score roughly the same on a test set doesn't tell you if one of them may be overfit. In your case, the learner with the RBF kernel shows signs of overfitting because its $R^2$ on the training set is quite a bit higher than that on the test set, so the learner with a liner kernel is likely a better model. However, you should probably look at multiple performance metrics such as RMSE, as well, and perhaps perform k-fold cross-validation to get a better idea of the performance of each model before making a final decision.

answered Jan 13 '23 at 14:49

Theaetetos

21

I did not mention in the post, but this was done with GridSearchCV to find the optimal hyperparameters, with its default 5-fold cross validation. Also, in the comment above I mentioned the MSE for both models. Are those values indicative of something? Also, is overfitting a problem if the model predicts equally good or a little bit better than another "not overfitted" model? – Sjotroll Jan 13 '23 at 14:56
@Sjotroll Those are really separate questions that deserve their own post. – Dave Jan 13 '23 at 15:23

How should $R^2$ score of training data compare to the $R^2$ score of test data?

2 Answers2