Interpreting R-squared when dependent variable is log transformed

Question

I'm working on a rental price prediction project and I want to make sure I'm evaluating things correctly. Basically, after I fitted some models with the data and compute R-squared on training and testing, the gap between them is a bit too large, the score was like 0.68 on testing but 0.76 on training. This is clearly overfitting and I tried a lot of techniques to reduce it but didn't improve that much. I later reinspected the data and found that my y(prices) are a bit skewed, so I applied log transformation on them, turns out, I got a better score by doing so. The R-squared now on training is around 0.78 and testing is 0.75.

I know that from here: How to compute the R-squared for a transformed response variable? I can't compare the R-squared between two models with different dependent variables, but my point is, seems like I reduced overfitting a little bit by doing so.

I just want to make sure I'm doing things on the right track, any suggestion is appreciated.

Edit: Someone pointed out I shouldn't transform the data by simply looking at the marginal distribution of y, but why would people on kaggle just did that:https://www.kaggle.com/code/apapiu/regularized-linear-models/notebook.

$R^2$ before and after the $\log$ transform are not comparable. We have posts on here about that. // Any normality assumption is about the conditional distribution of $y$ given feature values, not of the marginal distribution of $y$, so it is not clear that the transformation is even necessary. — Dave, Mar 30 '23 at 23:45
Thanks for your comment, but in what way I can tell whether the transformation is necessary or not. In fact, if I convert the predicted values back to non-log space, the R-squared or RMSE is pretty much the same as not doing the transformation, so I also suspect the transformation is not helpful at all. It's just the model is not improving annoys me. — user21352273, Mar 31 '23 at 01:46
Getting $R^2=0.68$ could be quite good. I have seen papers in top journals with $R^2<0.1$, granted, with the focus on causal inference. I also dispute that the gap in $R^2$ between training and testing is such a red flag of gigantic overfitting. It might be that you have overfit, but that is far from obvious to me. Why is it such a red flag to you? It’s basically a given that training performance will beat out-of-sample performance. // When and why to transform is a contentious issue whose discussion warrants it’s own question (for which I suspect there to be duplicates). — Dave, Mar 31 '23 at 02:36
I see, it’s not a red flag but I just want to get things done better. How would you evaluate a RMSE of 800 then, this is for testing, for the training, it was around 230 something. I modified the max_depth of the model(random forest), and other hyper parameters, common strategies, but when the training error goes up, the testing error goes up, when the training error goes down, the testing barely goes down, it just stuck around 800. Maybe it’s what it is, not much I can do I guess. The average rental price of my dataset is 2300 if it helps. — user21352273, Mar 31 '23 at 03:17

score 1 · Answer 1 · answered Apr 02 '23 at 17:45

$R^2$ is a monotonic (decreasing) function of the sum of squared residuals, which is the numerator of the fraction below.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

Therefore, if you increase (improve) the $R^2$, you are correct to think that you have decreased (improved) the sum of squared residuals...

...when the denominators are the same.

The trouble in your situation is that, when you apply the $\log$ transformation, you change $y_i$, $\hat y_i$, and $\bar y$. For instance, when I simulate data, I get very different denominators of $48272.05$ and $195.8113$ before and after the $\log$ transformation, respectively.

set.seed(2023)
N <- 1000
x <- rchisq(N, 1)
y <- 7 + 5*x + rnorm(N)
Ly <- log(y)
sum((y - mean(y))^2)   # 48272.05
sum((Ly - mean(Ly))^2) # 195.8113

Therefore, your $R^2$ values are not really comparable, and I do not see the argument that the $\log$ transformation reduced overfitting. In particular, consider what happens when you transform back to the original (non-$\log$) scale. It is not obvious that the back-transformed predictions will be better. Also, I am not even convinced that your performance is screaming out that severe overfitting has occurred.

Finally, there is no assumption that the marginal distribution of $y$ be normal, so there is not necessarily a need to $\log$-transform a skewed $y$ variable. An advantage to taking $\log(y)$ is that there is an interpretation in terms of percent change in $y$. If this makes sense for your application, you might be interested in such a transformation. However, skewness in $y$ alone is not a reason to apply a $\log$ transformation. The normality assumption in linear regression, if we even care to make such an assumption, concerns the conditional distribution, not the marginal distribution. This is a common misconception that I suspect many of these Kaggle cempetitors have.

As far as why you cannot get better performance out of your models, a great many factors go into a rental price, many of which (related to the way human brains function) I suspect are not included in your model. If you omit additional determinants of rental prices, then you are lacking information needed to make accurate predictions; of course your performance suffers.

Interpreting R-squared when dependent variable is log transformed

1 Answers1