2

I am trying to create a regression model for prediction. I need to generate prediction/confidence intervals for my model.

I am trying to decide whether to use a quantile regression or linear regression model. If I use a linear regression model I need to ensure my prediction residuals are normally distributed to compute valid confidence intervals.

I'm not experienced in testing for data normality. My sample size is too high to generate a meaningful p-value using the Shapiro test, so I have plotted the residuals on a histogram and produced a QQ-plot.

I am not sure how to interpret these results and would appreciate any input. The shape looks broadly normal, but the peak is very short and high, and the tails are pretty long.

What do you think? Is my data normally distributed?

enter image description here enter image description here

Archie
  • 205
  • 3
    No, the variable you've plotted (residuals?) is far from normally distributed. Also see this: How to interpret a QQ plot – dipetkov Aug 12 '22 at 11:55
  • 3
    What this looks like is the Laplace distribution. Interestingly, a) there is a link between the Laplace distribution and minimizing mean absolute deviations (MAD); b) quantile regression minimizes the MAD to estimate the median. So perhaps you'll get a more helpful answer if you provide information about the analyses you've done to generate these plots. – dipetkov Aug 12 '22 at 12:25
  • 1
    @dipetkov Those comments could be a good answer, as they get to the nub of the matter. – Nick Cox Aug 12 '22 at 12:29

1 Answers1

1

This is an extended comment rather than an answer.

The answer to your actual question is: No, both the histogram and the QQ plot indicate the distribution (of the residuals? of the response? of some other variable?) is not Normal.

The histogram calls to mind the Laplace distribution, which is symmetric like the Normal but with heavier tails; it puts less probability near the center/mean, so the peak is sharp.

enter image description here

Source: Wikipedia

Is this observation relevant to you? It's hard to say without knowing any details. Keep in mind that quantile regression (QR) makes no assumptions about the error distribution while linear regression (LR) assumes that the errors are Normal. Since QR makes fewer assumptions than LR, it would seem that it's a "safer" choice. The flip side is that QR is not efficient at estimating the median as LR is at estimating the mean. As usual, if we make more assumptions and those assumptions are reasonably satisfied, we get better estimates of the model parameters.

Finally, comparing QQ plots may not be the best way to choose between quantile and linear regression.

What are the assumptions for applying a quantile regression model?
When is quantile regression worse than OLS?
What are the advantages of linear regression over quantile regression?

Appendix

There is an interesting (but probably not relevant) connection between the Laplace distribution, mean absolute deviation from the median and quantile regression.

For the 0.5 quantile, QR effectively minimizes the mean absolute error $\sum_i|e_i| = \sum_i|y_i - x_i\beta|$ and the solution is the (conditional) median. The mean absolute deviation from the median is also the maximum likelihood estimate of the Laplace scale parameter.

dipetkov
  • 9,805
  • Thank you very much, this is a really helpful response! – Archie Aug 13 '22 at 16:38
  • My histogram and QQ plot are plotted using the residuals from the test set of a linear regression model, however the shape of both is identical when the plotted with the residuals of my quantile model instead.

    I don't quite understand the significance of the connection between the MAD from median and the Laplace scale parameter? Does this support the case for using a quantile approach to minimise errors for this kind of distribution?

    Many thanks.

    – Archie Aug 13 '22 at 16:54
  • Do you simulate data or use a "real" dataset? If you simulate data, what kind of errors do you use? To be honest, I don't think you should choose between linear regression or quantile regression based on this bit of (fun?) theory. The mean and the median have different properties as a measure of centrality. – dipetkov Aug 13 '22 at 17:03
  • Quantile regression doesn't make any distributional assumptions about the errors. On the other hand, it is inefficient, ie. requires more data. – dipetkov Aug 13 '22 at 17:04
  • This is a real dataset with and the model has a real application, for which I need some sensible prediction interval to create plausible min / max bounds. – Archie Aug 13 '22 at 20:06
  • The reason trying quantile regression is that I can rather conveniently fit a model to the 5th and 95th quantiles and use these to produce some plausible prediction interval. The errors I am using are normalised RMSE. Thanks for your help. – Archie Aug 13 '22 at 20:07
  • I'm not sure what you mean by normalized RMSE. In the question you say you plot the residuals. – dipetkov Aug 13 '22 at 20:18
  • Sorry, I am plotting residuals. I am using normalised RMSE to compare model performance. – Archie Aug 13 '22 at 20:40