4

I have two linear models, one with a log-transformed response variable, one without. I am trying to decide which is the better model before refining with stepwise etc.

Model 1 - Non-transformed model: Higher Adjusted R-Squared (0.

Model 1 Summary Model 1 Plots

Model 2 - Log-transformed response variable: Lower Adjusted R-Squared, better Q-Q plot (Closer to normal).

Model 2 Summary Model 2 Plots

What is more important, R-Squared value or getting closer to normality?

  • 2
    Welcome to Cross Validated! The $R^2$ values before and after the transformations are not comparable, so the post-transformation value being lower tells you little. Why do you want to $\log$-transform your data, just to achieve normal residuals? – Dave Mar 29 '24 at 15:38
  • Thank you! Yep, I wanted to log-transform my response variable to get rid of the heavy tails on my Q-Q plot. I'm stuck deciding whether or not it was worth it. – Andrew Maclay Mar 29 '24 at 15:41
  • 3
    Your QQ-plot of the first model doesn't show heavy tails, it shows tails that are too light for normality. – Christian Hennig Mar 29 '24 at 18:47
  • It looks as if the number of bathrooms may be the main driver of your models, which is why you get four clumps of fitted values. Something simpler like 40000+170000*numberofbathrooms would not be far out. – Henry Mar 30 '24 at 00:21
  • 3
    Frame challenge: consider the possibility that in many situations the answer might be "neither is of much import". The correct model might have quite low $R^2$ (a noisy process does not of itself imply any deficiency in the model) and mildly non-normal residuals (especially with light tails like you have) might be of little consequence in large samples (aside perhaps some small loss of power relative to a more suitable parametric model for the noise). There's no obvious reason to focus much thought on either in general. – Glen_b Mar 30 '24 at 00:21

4 Answers4

10

You have some good answers already. I'd just like to add that, aside from the particulars of your case, the question doesn't really make sense.

The question of normality is about the assumptions of the model and what happens if they are violated.

$R^2$ is a measure of how much variation in the dependent variable is accounted for by the model.

You could compare, say, the assumption of normality with the assumption of homoscedasticity. Or you could compare, say, $R^2$ with RMSE. But I don't see how you can compare across the two categories.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
6
  1. Issues with normality in the non-transformed model don't seem to be critical. Note that although formally a normal model is assumed, nothing in reality is really normally distributed, and the relevant question is not whether residuals are normal or not, but rather whether assumptions are violated in ways that can harm the analysis. As the residual distribution looks to have lighter tails than the normal, if anything, and isn't terribly skew, I don't see any problem with this.

  2. The second model seems to have an issue with homoscedasticity (equal error variances) as apparent in the Scale-Location plot. I don't see such issue with the first model.

  3. Depending on how exactly you want to use the model, it may be a good thing for interpretation and prediction to have a model that predicts the response directly rather than its log. (You can in principle run cross-validation to explore prediction quality, and in order to do this in a way that can compare different "versions" of the response, you could transform back predictions from the log-model to the original scale by taking exponentials. If you do this, chances are that the model based on the untransformed response will be quite a bit better, although in some situations predicting the log may be as good a thing as or better than predicting the original response.)

  4. Regarding "what's more important, $R^2$ or normality", the key issue here is that it is important how normality is violated and how this plays out. In any case I'd trust cross-validation more than $R^2$. To what extent $R^2$ is informative depends on what exactly you want to do, for example prediction outside the original value range (about which $R^2$ doesn't say much). I don't see a problem with the normality assumption here; if anything the issue with homoscedasticity in the log-model looks worse.

  5. Note also that technically assumptions for inference are invalidated by choosing analyses depending on what you see in the data. So if you first think that you want to model untransformed data, and then you see some plots and change your mind and decide to transform the response, technically model assumptions are already violated because they don't allow for such a data-dependent decision process. If there are serious issues with the first model, it may be worth to pay that price anyway, but I'd usually recommend to stick to the first model unless something really serious turns up.

  6. The somewhat outlying observation 32 in the log model together with the overall impression of the plots makes me think that normality actually doesn't even look better in that model than in the first one.

  • 4
    +1 That fifth piece seems to get forgotten a lot, that fiddling with your modeling based on seeing unsatisfactory results distorts downstream inferences. – Dave Mar 29 '24 at 16:00
  • 2
    And the OP is compounding the issue which @Dave refers to by planning to "refine the model with stepwise". – mdewey Mar 29 '24 at 16:25
  • I don't agree with item 5. Seeing that one has wild outliers may prompt the use of Spearman instead of Pearson correlation. Seeing a binary Y may prompt the use of logistic instead of OLS regression. These are valid choices. – rolando2 Mar 29 '24 at 20:36
  • 2
    @rolando2 I don't say you shouldn't do it in such cases. However I also say that the theory behind all standard inference does not allow for conditioning on data dependent choices. I start by saying model assumptions are not perfectly fulfilled anyway, so the fact that this violates model assumptions is not always enough of a reason not to do it, but it is a problem, as people who have investigated it know. For some reasoning and literature see this: https://jdssv.org/index.php/jdssv/%20article/view/73 – Christian Hennig Mar 29 '24 at 22:38
  • 1
    @ChristianHennig, that is indeed a nice paper and in fact you should add that in the answer, to the 5th point (and not in a comment). – User1865345 Mar 30 '24 at 08:52
  • @ChristianHennig Thank you; I will check out the article. – rolando2 Mar 31 '24 at 14:11
6

You seem to be under the impression that normality is an assumption for linear regression. Linear regression is just a model, and it can be fit by ordinary least squares, as you did. Normality is not assumption for the coefficients to be unbiased or consistent, nor it is it an assumption required for valid inference. It is an assumption required for exact inference using the t-statistics computed from the usual ordinary least squares standard error estimates. But who says exact inference is necessary, and who says you have to use ordinary least squares standard error estimates?

By the central limit theorem, the coefficients are normally distributed in large samples under mild conditions on the error distribution, which does not have to be normal. In addition, there are robust standard errors you can compute that maintain valid inference even when other assumptions about the model, like homoscedasticity, are violated. The sandwich standard error and bootstrap standard errors are such examples.

I highly recommend this post that describes what each assumption means what it gets you.

Noah
  • 33,180
  • 3
  • 47
  • 105
4

If you violate the normality assumption, you can still fit a linear regression model, but your conclusions won't be correctly interpreted, since the model tells you what happens, conditioning on the fact that the assumptions of a simple linear model does hold true.

Now, $R^2$ did receive a lot of criticism in the literature, for example in this link. Also, your $R^2$ before and after the transformation are not really comparable, since they fit different outcomes. In the end, they are not the best measure to conclude a good fit (since they are not always even seen as a goodness-of-fit measure).

In the end, I think it is much better practice to fit a model of which the assumptions are not violated, rather than a model with violated assumptions, but that has some diagnostic measure that gives better values.

  • 1
    That linked article has a such narrow view of $R^2$ and of modeling in general that I find it easy to dismiss it. For instance, Gneiting & Resin argue in their 2023 "Regression diagnostic meets forecast evaluation" that the classical $R^2 = 1-RSS/TSS$ is extremely reasonable, as I have argued on here many times. (Maybe they read my Cross Validated posts!) $//$ Standard linear regression inferences based on $iid$ Gaussian errors are fairly robust to non-Gaussian distributions, especially when the sample size is a few hundred like it is here. – Dave Mar 29 '24 at 15:57
  • 1
    I completely agree! $R^2$ isn't really that bad as the article says in my opinion, but I just put it there because it does talk about some reasons why to watch out for it. – Mathemagician777 Mar 29 '24 at 16:03