1

How do you apply count models to data which is count in nature, but a rate in reality? In such cases, r can handle this to a certain extent, depending on the model, but what is the correct way to model a rate response with count models?

Data & Model

df <- tibble(dependent_rate = c(5.2, 3.4, 7.8, 9.5),
             dependent_count = c(5, 3, 7, 9),
             pred1 = c(1, 2, 3, 4),
             pred2 = c(1, 2, 1, 2),
             pred3 = c(1, 1, 2, 2))

glm.nb(dependent_rate ~ pred1 + pred2 + pred3, df)

Model 1 (implemented in R above) throws a warning. Ideally model 2 should be used, but it is unclear how to use dependent_count as the response variable while accounting for the rates.

enter image description here

Therefore my questions/possible solutions to this are:

  1. Apply weights to model 2 - if so, how would I do this? do I simply add weights = dependent_rate in the function call?
  2. Add an offset term to model 2 - if so, how? I would like to make predictions with this model, would I need to add an column in newdata for my offset term?
Ali
  • 209

1 Answers1

1

Here is a way:

glm.nb(dependent_count ~ pred1 + pred2 + pred3 + 
       offset(log(dependent_count/dependent_rate)), data = df)

A detailed explanation is at Goodness of fit and which model to choose linear regression or Poisson. This works since the default link function used by MASS::glm.nb is the log link. The explanation in the link is for Poisson regression, but it applies equally for negative binomial. Search this site for rate regression for similar posts.

Prediction with an offset should not present special difficulties. To calculate a rate you need the count and the denominator. The denominator in a rate is usually some time length, or area of some (adnonistrative?) region, or population, or ... but should be known. So just use that in the offset as with any other known predictor.

  • Thanks for your answer. When I want to predict with new data, I need to specify the value for 'dependent_count'. This is the response which I am trying to predict and so I cannot specify the value for this in my new data. – Ali Feb 23 '21 at 16:49
  • But you must know the denominator for calculating the rate, even if it was not given in your data? – kjetil b halvorsen Feb 23 '21 at 16:52
  • I'm not sure I follow. In my example above, let's say I want to predict: pred1 = 4, pred2 = 1, pred3 = 1. In my original data, this observation does not exist and so I do not know the denominator for it. This is a new observation for which i'm trying to predict the rate for. How would I know the denominator beforehand? I would know the denominator for the cases which are in my data. – Ali Feb 23 '21 at 16:55
  • This looks like the way forward? https://m-clark.github.io/posts/2020-06-15-predict-with-offset/ – Ali Feb 23 '21 at 17:05
  • 1
    I will add some to the answer! – kjetil b halvorsen Feb 23 '21 at 18:08