2

I've build two models using Support Vector Machines, one with 'linear' kernel and the other with 'rbf' kernel. The r2 score of the test data in both cases is pretty much equal, around 0.82, but the score on training data for the 'linear' kernel is 0.84, while for the 'rbf' kernel is around 0.94.
I understand that overfitting to the training set is possible, but shouldn't that yield lower r2 scores on the test set? In my case, which model would be deemed better?

EDIT: The models are fitted using GridSearchCV from sklearn, with 5-fold cross-validation.
The MSE for the 'linear' kernel on training set is 6e-3, and 8e-3 for the test set.
The MSE for the 'rbf' kernel on training set is 1e-3, and 6e-3 for the test set.

Dave
  • 62,186

2 Answers2

2

I’ve written on Stack Exchange multiple times that $R^2$ is just a monotonic transformation of the $\operatorname{MSE}$, so having higher/lower $\operatorname{MSE}$ is equivalent to lower/higher $R^2$ (when $R^2$ is calculated the way I believe $R^2$ should be calculated).

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

This, however, assumes a calculation on the same data set. When you calculate on different data sets, the denominator terms are different, meaning that the train and test $R^2$ values are not the same monotonic transformation of the $\operatorname{MSE}$. Consequently, it does not make sense to compare the two values any more than it makes sense to compare $R^2$ and $\operatorname{RMSE}$. Yes, both measure square loss, but they do so in different ways.

However, being interested in $R^2$ tells me that you are interested in square loss. Fortunately for you, then, it makes perfect sense to compare the train and test $\operatorname{MSE}$ values. For instance, when you use the “rbf” kernel, you wind up with eight times as high of a test $\operatorname{MSE}$ than train $\operatorname{MSE}$, suggesting quite a bit worse of an out-of-sample fit compared to in-sample, which might suggest overfitting to coincidences in the training data as opposed to modeling the true phenomenon.

Dave
  • 62,186
1

You are correct that lower $R^2$ (or other performance metrics) in a test set than the training set is a sign of overfitting. This comparison done by model, however: the fact that two different models score roughly the same on a test set doesn't tell you if one of them may be overfit. In your case, the learner with the RBF kernel shows signs of overfitting because its $R^2$ on the training set is quite a bit higher than that on the test set, so the learner with a liner kernel is likely a better model. However, you should probably look at multiple performance metrics such as RMSE, as well, and perhaps perform k-fold cross-validation to get a better idea of the performance of each model before making a final decision.

  • I did not mention in the post, but this was done with GridSearchCV to find the optimal hyperparameters, with its default 5-fold cross validation. Also, in the comment above I mentioned the MSE for both models. Are those values indicative of something? Also, is overfitting a problem if the model predicts equally good or a little bit better than another "not overfitted" model? – Sjotroll Jan 13 '23 at 14:56
  • @Sjotroll Those are really separate questions that deserve their own post. – Dave Jan 13 '23 at 15:23