1

I have run a model and with the data I have, I wasn't expecting to produce the best model ever but my residuals are really strange. The outcome variable is number of days going to a website in a month (so discrete 0-30) and I have applied a standard regression. model <- step(lm(day_count_jun ~ vars, data=ml), direction="forward"))) #forward-backward

I have several independent variables and I have chosen (with function step in R) 5 of them. It is the same result if I include only 5 variables and all the variables I have. The adjusted R2 is 0.25 Here is the residual plot I have: enter image description here

Should I understand that one of my coefficients should have a negative sign? Besides, my residuals are not normal.

enter image description here

What can I do to solve this if I don't have more data?

DroppingOff
  • 537
  • 1
  • 4
  • 12

1 Answers1

5

First, the use of step is irrelevant so I'll ignore that below.

Second, your dependent variable is integer valued and bounded above and below, so it's rather unlikely that your residuals would ever look Normal. You might therefore try a model that does not assume that they are, such as

mod.bin <- glm(cbind(day_count_jun, 30-day_count_jun) ~ vars, 
               family=binomial, data=ml)

or if you have mostly low counts, a Poisson assumption might be reasonable:

mod.pois <- glm(day_count_jun ~ vars, 
                family=poisson, data=ml)

And if you have a lot of zeros that you think are driven by a separate process, then a zero-inflated or hurdle version of one of these two models might be better. For what it's worth, that's what the histogram says to me.

One of those alternatives should get you a better model, or at least one whose forecasts will not be nonsensical.

Back to the residuals: even when you fit these models, the residuals will always look a bit bizarre relative to a model that assumes conditional Normality. The key is to look rather at the smoothing line than the dots.

Finally, be a little careful as you add lagged variables in these models (as I see you have done from the graph). At the very least you'd want the log of day_count_april for an autoregressive model with a log link.

There's good short discussion of all these things in Cameron's note on count regression.

  • Hi, thanks fr your answer. I cannot use binomial because it is not a 0/1 outcome. And I cannot use poisson because it is not counts right? I want the number of days connected not the number of people connecting. Any other ideas of what should I use? – DroppingOff Aug 11 '14 at 13:31
  • Good to know the residuals will always look normal but when there is a trend they are usually telling something else. I would like to know what they are telling in my case. What is lagged variable? – DroppingOff Aug 11 '14 at 13:35
  • 3
    Wrong about the Binomial. Example: if a person has a constant 0.1 chance of visiting a website each day then the number of times they visit in June is Binomially distributed with p=0.1 and n=30. – conjugateprior Aug 11 '14 at 13:36
  • If your dependent variable is how many days in a month a person goes to a website it's a count. – conjugateprior Aug 11 '14 at 13:38
  • 'Monthly count' is the variable. 'Lagged' means taking a variable from the previous time step. So a lagged monthly count is the count from a previous month. The distance is given by a number, e.g. Lag 0 is just June's counts themselves, Lag 1 is May (one month back), Lag 2 is April (2 months back), etc. – conjugateprior Aug 11 '14 at 13:42
  • I'm guessing it's a typo on your part but my point was that the residuals will (almost) never look normal for this sort of data. – conjugateprior Aug 11 '14 at 13:44