5

lately I lost access to SPSS and instead of using Python or R, I tend to perform analysis using a free software called Jamovi.

The thing is, this software doesn't have the different non-linear regression models I liked to use in SPSS, so I tend to make only linear-regression models.

Because of this I became used to perform log transforms in the data prior to making the model. I noticed that most of the time, after transforming a predictor in the linear regression model, the R squared increased.

I know that log transform can reduce the skewness of the data, but I wonder why it always seems to improve the linear-regression model (only using the log transformed variable as a predictor, not the original variable).

Is this a true improvement or is it some kind of statistical artifact? Why does this happen? Should I use it every time?

gabriel
  • 93
  • 5
  • 1
    Do you mean that the $R^2$ of the $y\sim X$ model is lower than the $R^2$ of the $\log(y)\sim X$ model? – Dave Oct 17 '23 at 12:33
  • 1
    No, the other way around. I perform log on the predictor not the predicted variable. – gabriel Oct 17 '23 at 12:34
  • 2
    Some useful links: https://stats.stackexchange.com/questions/18844/when-and-why-should-you-take-the-log-of-a-distribution-of-numbers , https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va , https://stats.stackexchange.com/questions/27951/when-are-log-scales-appropriate – mkt Oct 17 '23 at 13:03

2 Answers2

12

There's no reason why this would always improve your $R^2$, most likely it is because you are forcing a linear fit where it isn't ideal and your transformation 'linearizes' the predictor somewhat.

Here's a quick counterexample where the data generating mechanism is in fact linear:

set.seed(1)
x <- runif(1e2)
y <- x + rnorm(1e2, 0, 0.1)

summary(fit <- lm(y ~ x))$r.squared # 0.897 summary(lfit <- lm(y ~ log(x)))$r.squared # 0.736

The first 100,000 seeds all show a better $R^2$ for the untransformed predictor.

PBulls
  • 4,378
7

It doesn't. Here is a very simple example where $R^2$ is virtually identical (but a TINY bit higher before the transform). This is R code, anything after a # is a comment

set.seed(1234)

x <- rnorm(1000, 10, 1) y <- 3*x + rnorm(1000, 0, 5)

m1 <- lm(y~x) m2 <- lm(y~log(x))

summary(m1) #0.3083 summary(m2) #0.3027

and, if I run the code again (without the first line, so, a different seed) the untransformed version has $R^2$ of 0.30 and the transformed one, 0.27.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • 1
    Oh, thank you, I guess the data I use tends to benefit from that transform whenever I use it coincidentally. – gabriel Oct 17 '23 at 12:42
  • 1
    You're welcome. Since you are new here, I'll say that, if an answer meets your needs, it is common to "accept" it by clicking the check mark. – Peter Flom Oct 17 '23 at 12:46
  • 2
    But it still begs the question if most data tends to benefit from the transformation anyways. If I used it blindly, it might be that most of the time the transformation leads to an improvement in R2. Specially if I deal with skewed data. – gabriel Oct 17 '23 at 12:46
  • 3
    @gabriel It's not necessarily a benefit, it's just different. I wouldn't base a transformation decision on differences in $R^2$. – mkt Oct 17 '23 at 12:54
  • 3
    "Most data" is too vague to be evaluated. But it's certainly possible that, if your data are very right skew, this could help. Also, if your variable is skewed your errors are often skewed, and this violates an assumption. – Peter Flom Oct 17 '23 at 12:55