1

When I am developing a predictive statistical model, why do I need to ensure the error is normally distributed? (I have a very small statistical background, so I apologize in advance if this is a very, very basic question).

Johanna
  • 583
  • Could you please cite where you read that this is important? I dispute the claim and want to see the context. – Dave Sep 09 '20 at 15:39
  • 1
    Closely related threads, which provide many answers, include https://stats.stackexchange.com/questions/16381, https://stats.stackexchange.com/questions/148803, https://stats.stackexchange.com/questions/86835, https://stats.stackexchange.com/questions/395011, etc. – whuber Sep 09 '20 at 15:46
  • 1
    Predictive model has somewhat different nuance than regression. Why close? – BigBendRegion Sep 09 '20 at 16:01
  • I second what @BigBendRegion, but nevertheless will read closely the indicated question and the remaining questions indicated in the comments. Thank you :) – Johanna Sep 09 '20 at 16:43
  • 1
    The suggested answers really are not adequate since the focus of predictive modeling is different so I will provide answers in replies. – BigBendRegion Sep 09 '20 at 18:52
  • 1
    The conditional distributions of the target variable do matter a great deal for predictive modeling. In the process of checking for normality, you may have some very obvious indications of non-normality that will indicate alternative models and/or methods are needed.

    Examples:

    – BigBendRegion Sep 09 '20 at 18:53
  • 1
  • The data are very discrete. In the most extreme case, the data have only two possible values, in which case you should be using logistic regression for your predictive model. Similarly, with only a small number of ordinal values, you should use ordinal regression, and with only a small number of nominal values, you should use multinomial regression.
  • – BigBendRegion Sep 09 '20 at 18:53
  • 1
  • The data are censored. You might realize, in the process of investigating normality, that there is an upper bound. In some cases the upper bound is not really data, just an indication that the true data value is higher. In this case, ordinary predictive models must not be used because of gross biases. Censored data models must be used instead.
  • – BigBendRegion Sep 09 '20 at 18:54
  • 1
  • In the process of investigating normality (eg using q-q plots) it may become apparent that there are occasional extreme outlier observations (part of the process that you are studying) that will grossly affect ordinary predictive models. In such cases it would be prudent to use a predictive model that minimizes something other than squared errors, such as median regression, or (the negative of) a likelihood function that assumes heavy-tailed distributions. Similarly, you should evaluate predictive ability in such cases using something other than squared errors.
  • – BigBendRegion Sep 09 '20 at 18:54