1

It is known that $R^2$ should not be compared between two regressions where one uses features $X_1,\dots ,X_n$ to predict $Y$ and the other uses those same features to predict $\log(Y)$.

However, this relied on taking the logarithm before taking the expected value.

If we consider two GLMs with Gaussian conditional distributions, one linearly modeling $\mathbb E[Y\vert X=x]$ and another linearly modeling $\log\left(\mathbb E[Y\vert X=x]\right)$, is the comparison of $R^2$ values still illegitimate?

Dave
  • 62,186
  • I wasn't planing to post a self-answer, but as I wrote the question and included my suspicions, I began to realize that I believed my suspicions to be right. Nonetheless, I welcome any other thoughts on this matter and am willing (eager, even) to accept an answer by another member. – Dave Oct 04 '22 at 17:53

1 Answers1

2

LEGITIMATE!

We are assessing the performance of two models of $\mathbb E[Y\vert X=x]$:

  1. $\mathbb E[Y\vert X=x] = X\hat\beta_{ols}$

  2. $\mathbb E[Y\vert X=x] = e^{X\hat\beta_{glm}} \iff \log\left(\mathbb E[Y\vert X=x]\right) = X\hat\beta_{glm}$

We know that $R^2$ in the second case lacks the usual "proportion of variance explained" interpretation that is convenient, but if we view $R^2$ as a function of the square loss, everything is okay if we are concerned about square loss (keeping in mind the usual concerns about $R^2$).

This is no different from evaluating any other two models of the same data and comparing them on square loss (or some function of square loss like $RMSE$ or $R^2$).

Dave
  • 62,186