1

I am doing linear regression on a dataset. I divided the data into training (70%) and testing(30%). Here are the metrics for training and testing data:

Training data: R2 is 0.85 and RMSE is 2339

Testing data: R2 is 0.67 and RMSE is 2238

Based on RMSE, the model does better on the Testing data as it has lower RMSE values. But based on R-square the model performs better in training data. I was hoping that if the model gives low RMSE then it should have a high r-square and vice-versa. Please could anyone explain these conflicting results?

Edit based on whuber's comment: The variance of responses in the training set is higher, so I think that explains the high value of $R^2$ in the training test. So based on this, can we say that $R^2$ can be misguiding sometimes and better to stick with RMSE?

Edit based on Dave's comment: $R^2=1-\frac{\sum(y_{test}-\hat{y}_{test})^2} {\sum(y_{test}-\bar{y}_{test})^2}$ The python code for this:

y_pred_test=model.predict(x_test) ##model is trained model on training data
r2_score(y_test,y_pred_test)
Pitouille
  • 1,482
Kim
  • 11
  • 3
    Pay attention to the variances of the responses in the two datasets. – whuber Oct 06 '21 at 13:10
  • How do you calculate out-of-sample $R^2$? – Dave Oct 06 '21 at 13:11
  • Yes @whuber, the variances of responses are higher in the training set. I also edited the question based on this information. – Kim Oct 06 '21 at 13:27
  • @Dave, I compare predicted values of responses in the testing set with the actual values of responses in the testing set to calculate it. – Kim Oct 06 '21 at 13:32
  • Please write out the equation. – Dave Oct 06 '21 at 13:33
  • @Dave, yup made the required edits. – Kim Oct 06 '21 at 13:48
  • 1
    Based on your edit: if you care about prediction accuracy then choose the relevant metric like RMSE to worry about in cross validation. As you found R squared can be extremely misguiding and should only ever be referenced but not used as the end-all-be-all. – Tylerr Oct 06 '21 at 14:02
  • 1
    You should be using $\bar y_{train}$ in the denominator. Remember that $R^2$ is a comparison of model performance to performance of a model that naïvely guesses the mean every time. It only makes sense for that mean to be the in-sample (training) mean. – Dave Oct 06 '21 at 14:14
  • Thanks @Dave. That makes sense! – Kim Oct 06 '21 at 15:12

1 Answers1

1

You should be using $\bar y_{train}$ in the denominator, not $\bar y_{test}$. Remember that $R^2$ is a comparison of model performance to performance of a model that naïvely guesses the mean every time. It only makes sense for that mean to be the in-sample (training) mean.

You seem to be implementing with the Python function sklearn.metrics.r2_score, which uses $\bar y_{test}$ in the denominator. I disagree with this and find it unmotivated to use such a formula, as it can give this kind of silly result.

Dave
  • 62,186