4

I have several binomial (logit) models that predict the probability of various binary outcomes (e.g., incurring any costs for health care treatment, having a chronic disease). My case and control groups are matched on average length of time enrolled in the study, but the time any given individual is enrolled can be highly variable, from 30 days to 9 years. I want to take this into account in the models and have read several explanations of whether, when, and how it is (or isn't) appropriate to include an offset in a binomial model (e.g., Using offset in binomial model to account for increased numbers of patients), but I am ultimately confused by what the best approach is.

Taking the example of one of my binary outcomes, the regression model

glm(binary_outcome ~ enrolled_days, data = df, family = "binomial")

gives an unexponentiated coefficient of 0.0004 for enrolled_days and an exponentiated coefficient of 1.0004, which to me indicates that I should use it as an offset if I believe the probability of the outcome increases proportionally to the number of days enrolled in the study. I think this is a fairly reasonable assumption in my case.

Am I correct in coming to this conclusion? If so, would I incorporate the variable enrolled_days as

glm(binary_outcome ~ predictors + offset(enrolled_days), data = df, 
                     family = "binomial")

? When I do this with my data, I get a warning message that the algorithm failed to converge and that fitted probabilities of 0 or 1 occurred. I don't understand why this would happen, since I have a large sample (~55,000 people), and the average enrolled_days (and the min and max) is the same for both cases and controls.

1 Answers1

2

You observe if an event occurred or not, but the length of the observation interval varies. If you model the events as a Poisson process, but you only observe if there are zero or more events, but not the actual number, this can be modelled as a binary binomial regression, but not a logistic one.

Use the complementary log-log link function, in R that would be something like

 glm(Y ~ 1 + offset(log(exposure)), family=binomial(link="cloglog"))

Let the intensity (also called hazard) of the Poisson process be $\lambda$. Then the probability of zero events in time interval $t_i$ is $e^{-\lambda t_i}$, so the (binomial) probability of 1 or more events is $p_i = 1-e^{-\lambda t_i}$. The complementary log-log link function is $$ \eta_i =\log( -\log(1-p_i)) = \log(\lambda)+\log(t_i) $$ so when you model the link scale as a linear model, as usual with glm's, you have to use log(exposure) as an offset, and the estimated intercept is an estimate of $\log\lambda$.

This is also discussed and explained at Modelling a binary outcome when census interval varies. Some other posts here with examples of the cloglog link function

and search this site!

AdamO
  • 62,637
  • 1
    This is an illuminating answer. I searched the site and have two questions not addressed elsewhere. They are either good fodder for a revision/addition (if you agree) or another site question: 1. what's the interpretation of the final (exponentiated) parameter? a relative rate or an odds ratio? 2. How does one assess the reasonableness of the constant hazard assumption? One conceptualizes that a one-and-done kind of exposure, such as contracting HIV, is not the same as modeling independent interarrival times. Maybe I misunderstood! – AdamO Mar 25 '24 at 17:50
  • @AdamO: Thanks! I will add something for 1., for 2. better to ask a question. Some posts I find: https://stats.stackexchange.com/questions/631711/testing-the-proportional-hazards-assumption-with-a-time-varying-covariate, https://stats.stackexchange.com/questions/560975/how-to-interpret-schoenfield-residual-plot, – kjetil b halvorsen Mar 26 '24 at 02:02