2

When calculating the r^2 of some model on some testing set, we're effectively comparing the MSE of that model's predictions to the MSE of some naive, base-line model that always predicts the sample mean of the target variable in the test set.

But that naive model does soemthing that the actual model can't do - it "peeks" into the test data (to see its target variable sample mean).

If the test data's sample mean might be significantly different than the training data's sample mean, it looks natural to define another statistic, where that naive model uses the train data's sample mean instead of the test data.

Effectively, this definition just replaces $r^2 = 1- {MSE \over {\sigma^2_{y_{test}}}} $ with $r^2 = 1- {MSE \over {\sigma^2_{y_{test}}+({\bar y_{test}} - {\bar y_{train}})^2}} $, giving slightly higher values (derivation below).

Is there a name for this statistic? Is the main point behind it ("don't let your 'baseline' peek into the test data") covered by some other statistic?

The derivation is as follows: $$ r^2_{out-of-sample} = 1- \dfrac{MSE}{{1 \over N_{test}} \sum_{i \in \text{Test Group}} (\bar{y}_{train} - \hat{y}_i)^2} $$ Where $MSE = {1 \over N_{test}} \sum_{i\in\text{Test Group}}(y_i-\hat{y}_i)^2$ (as in @Dave's answer below).
The expression in the denominator is equal to $\sum_{i \in \text{Test Group}} (\hat{y}_i-\bar{y}_{test}+\bar{y}_{test}-\bar{y}_{train})^2$, which can be expanded to ${1 \over N_{test}} \sum_{i \in \text{Test Group}} ((\hat{y}_i-\bar{y}_{test})^2+(\bar{y}_{test}-\bar{y}_{train})^2-2(\hat{y}_i-\bar{y}_{test})(\bar{y}_{test}-\bar{y}_{train}))$, whose first term is $\sigma^2_{y_{test}}$ and last term vanishes.

We also note that $r^2_{out-of-sample} = {{r^2+(\bar{y}_{test}-\bar{y}_{train})^2} \over {1+(\bar{y}_{test}-\bar{y}_{train})^2}}$

1 Answers1

1

You’re thinking about this the right way, but shame on whatever source told you to use a baseline model that always guesses the pooled mean of the test set.

The idea behind $R^2$ is that the most naïve but reasonable model for the conditional mean is the mean of all observations of the response variable. When you change over to the test set, the most naïve (but sensible) guess for the out-of-sample data still is the pooled mean of all observations of the response variable.

(I say a “sensible” guess because you could do something totally silly and always guess zero or twelve or a billion zillion, but we have some information about the response variable.)

So I would call your idea the canonical out-of-sample $R^2$. I do not totally follow your equation, however. I would do out-of-sample $R^2$ as:

$$ 1- \dfrac{\underset{i\in\text{Test Group}}{\sum}\bigg(y_i-\hat{y}_i\bigg)^2} {\underset{i\in\text{Test Group}}{\sum} \bigg(\hat{y}_i - \bar{y}_{train}\bigg)^2} $$

One note: the familiar “proportion of variance explained” interpretation of $R^2$ fails to hold in many situations.

EDIT

A discussion of how to calculate out-of-sample $R^2$ can be found at this link.

Dave
  • 62,186
  • Thank you for your response! No source told me about this base-line - I'm just phrasing something that would result in the common r^2 metric. Is this "out-of-sample r^2" a known metric, that I could read about more somewhere? – Itamar Mushkin Aug 23 '20 at 06:51
  • Your formula is the same as mine; I'll add the derivation. – Itamar Mushkin Aug 23 '20 at 06:53
  • I think the penultimate sentence is not helpful. E.g. out of sample testing has been a classic in the time series forecasting literature before machine learning got involved there. – Richard Hardy Jul 11 '22 at 07:02