0

I am building a prediction model that tries to predict 'sales'. My dataset contains sales(quantity) and around 60 features. The features are mostly weather features (e.g. temperature, humidity, sunshine hour duration) and binary weekday(info) (monday 0 or 1, tuesday 0 or 1, holiday 0 or 1).

I built my Generalized Linear Model but I am not sure what 'family' and 'link' to chose. Currently it looks like this model_1 <- glm(quantity ~., data = train_set) My dependent variable (sales) is distributed as follows:

distribution of pie sales

Hope someone can inform me about which family and link to chose.

  • 2
    It's difficult to say. This looks like over-dispersed count data, so you could try poisson regression with a log link in the first instance, since it is possible all your regressors explain the over-dispersion. If, as seems likely, your dispersion remains too high you could use negative binomial regression, which you can do with glm.nb from the MASS package. However, this is a stats question rather than a programming question, and belongs on CrossValidated rather than Stack Overflow. I have voted to migrate your question over there. – Allan Cameron Jun 14 '22 at 13:30
  • Please provide enough code so others can better understand or reproduce the problem. –  Jun 14 '22 at 13:59
  • 1
    Does this answer your question? Why is Poisson regression used for count data? Poisson regression with a log link is a usual starting point for such analysis, although you might need to allow for dispersion beyond what's expected from Poisson with a "quasi-Poisson" model or a negative binomial model. – EdM Jun 14 '22 at 15:23

0 Answers0