0

I am using the famous diamonds dataset and I'm using linear regression to fit the following model: $\ln(\text{price}) = \beta_0 + \beta_1 \cdot \ln{\text{x}} + \beta_2 \cdot \text{clarity}$

where carat is a numerical value and clarity is a categorical one (so, one coefficient per dummy).

The model gets a ~ 96% $R^2$ coeficient and the fit seems quite good, but the residuals are not normal. My question is, in light of the following information, is this a problem? How should I treat this problem?

The plots:

y_test vs y_pred:

enter image description here

Residual vs y_test:

enter image description here

Residual vs predictor (x):

enter image description here

Residual histogram

enter image description here

QQ plot:

enter image description here

Besides that, residual stats:

mean 0.000654 std 0.196600 skew -0.264115 kurtosis 1.586213

Should I be worried? Should the non-normality of my residuals be addressed?

  • 1
    personally, imho, i don't see many problems with your model... and residuals looks normally distributed to me – Alberto Jul 12 '22 at 18:13
  • Visually, yes. But when I perform tests on them (shapiro, deangelo, etc) it always comes back non-normal. By a large margin too, the statistic is always huge and the p-value is always 0.0, meaning it's not even "on the fence". – Javier Ventajas Hernández Jul 12 '22 at 22:27
  • 1
  • You have so much data that you can't see the relative density in different parts of the plot; I recommend smaller points or using transparency (making the points partly transparent, if you're using software that will allow you to do that) or both. 2. "when I perform tests on them (shapiro, deangelo, etc) it always comes back non-normal" --- sure, they'll all reject, but that's useless information. Of course this distribution is not actually normal. With a huge sample size naturally you can see that it isn't exactly normal. ...
  • – Glen_b Jul 12 '22 at 23:56
  • 1
    ... The relevant question is "how much will that matter" (which is not in any sense a question of significance; in large samples you can detect trivial deviations from normality). It's more like an effect size issue, but the things you're interested in the properties of are not equally sensitive to every kind of deviation from normality either. .... (btw your plots look fine to me too; you may have issues here somewhere, but normality of residuals is not likely to be among them; your distrivbution is heavy tailed but it doesn't seem of itself to be an issue). – Glen_b Jul 12 '22 at 23:56
  • There are features in your plots that would make me wonder if there's something (perhaps a missing but relevant variable, perhaps something else) that you might need to account for in your model. It's not necessarily the case, but for example that line of high points in the left half of the residual-vs-x plot is somewhat curious, enough to make me wonder if there's something relevant that's not in the model. – Glen_b Jul 13 '22 at 00:02
  • Not afaik. The other features are highly colinear with the two predictors I used. – Javier Ventajas Hernández Jul 13 '22 at 07:19
  • @JavierVentajasHernández Tests of normality such as the Shapiro will etc. pick up on small deviations from normality. With a large dataset, you will usually find that the tests will show residuals to be non-normal. I would rely more on the graphs like the Q-Q plot (or P-P plot) which looks acceptable. Your model seems to be doing reasonably well. – Spur Economics Jul 13 '22 at 07:44