4

Let's say we have "number of accidents" as the response in a poisson regression model. One of the predictors is "number of days." Naturally, we expect more accidents to occur over more days, so it makes sense to treat this as a rate model and consider "number of accidents per day."

So we take $$ \log \frac{accidents}{days} = \log(accidents) - \log(days)$$ and end up with a model like $$\log(accidents) \sim \log(days)\ + \ldots + (other\ predictors) $$

Typically we would expect the coefficient on log(days) to be close to 1, at least according to my reading thus far: We expect the occurrence of interest to be proportional to the amount of time. As such, we would "offset" log(days) in our stats software, forcing the coefficient to be 1.

Suppose, however, we don't offset "days," and the parameter ends up being something nowhere near 1. What does that mean (if anything)? Does this suggest that it is inappropriate to use the offset because there isn't a constant rate of events of interest per time period?

Say, for instance, our statistics software tells us the coefficient for log(days) is 3. That suggests a multiplicative effect of $e^3(days)$ on number of accidents, if I understand Poisson regression coefficients correctly. Does this mean it would be inappropriate to treat "days" as a component of a rate? Would we instead need to investigate some other time-related component that could impact the response? Or would we offset "days" with a coefficient different from 1?

dlid
  • 606

1 Answers1

2

For your dataset, I don't think this can be answered without more context. Why do your variable days vary? If we start with some dataset with individual accidents, you clearly then have got your data by counting the number of accidents, per day, per week, ... and naturally we should get a dataset where the accumulation period is constant. You would need to tell us why you didn't do that.

Maybe, as a thought experiment, your data covers a long period of time, and that the oldest data where reported only with accumulation over some longer period (months ?), while newer data are accumulated daily. Then your days variable will also be a stand-in for the (unmodeled) variable date. That you get a coefficient larger than one in such a case would then indicate that accident rate were higher long ago. This just as an example of the kind of analysis you need to do.

By the way, you can include days both as an offset and as a normal predictor variable in the same model, see Can a variable be used both as an offset and an independent variable?