2

I have some data where I follow some people for 365 days. I have a response variable that is either 1 or 0. The response variable is an event that may or may not happen between day 1 and day 365. If the event happens (1), I know at what day it happens. I've been doing logistic regression with success.

Now the data has changed, and some of the customers suddenly have varying exposure (less than 365 days). For example, some people we'd only follow for 180 days. I still know if an event happens before the 180 days (of exposure), but if any event happens after the 180 days is unknown.

My manager says that I can still do logistic regression, I just have to take into account the varying exposure, that is, include the exposure as an explanatory variable in the model. This just feels wrong to me, and I say that this is what cox regression/survival analysis is made for, but I can't really argue for why it is wrong to do it the suggested way.

Is my manager wrong, and if so: could somebody please explain why?

Erosennin
  • 1,734
  • Please edit the question to say more about the nature of the response variable. For example, can it switch back and forth between 0 and 1 for the same individual over time, or is it always at 1 once it reaches that value? If the latter, do you know the time of the occurrence of the switch from 0 to 1, or just that it happened some time within the 365 days of observation? – EdM Jan 16 '24 at 14:24
  • @EdM done! Can't switch back and forth. If an event (1) happens, it is always 1. Imagine this is somebody dying or defaulting. And yes, we know the time of occurence. – Erosennin Jan 16 '24 at 20:37

1 Answers1

3

Why is Logistic Regression Wrong?

If you had observed all subjects for the same amount of time, logistic regression would be fine. The censoring -- not knowing what happens after 180 days -- is the problem.

Think of this in an extreme case. Suppose we're interested in examining the risk of death. However, our sample consists mostly of children (who do not typically die young, and often go on to live into adulthood). The short observation time of the children means they will likely contribute an excess of 0's to our estimate of the risk of death. Hence, that estimate is inappropriate to apply to the entire population; the children simply have not enough time to accrue risk of death by ... much of anything. Adults, by contrast, have had more opportunities to die from crossing roads, smoking cigarettes, having plaque accumulate in their arteries, and so on and so on. Observing all subjects for the same amount of time means they have had (assuming all else equal) the same time to accumulate the same risk.

How Can We Analyze Such Data?

You could use a cox model, but you could also do a Poisson regression with robust covariance estimation and use exposure time as an offset, as is explained here. The {survival} R library has a function called pyears which calculates the length of exposure. The docs for pyears show an example of how to use the exposure time as an offset, but they do not show how to estimate robust covariance in that example.

Let's see how to do this with an example. First, let's simulate some data. Assume we've randomly assigned patients to two groups. Group membership is indicated by trt.

N <- 1000
trt <- rbinom(N, 1, 0.5)
raw_tm <- rexp(N, 1/(100 - 50*trt))
cens_tm <- 365
tm <- ceiling(pmin(raw_tm, cens_tm))
event <- 1.0*(raw_tm < cens_tm)

d <- data.frame(tm, event, trt)

Now, let's use pyears to count the outcomes in each group and the total person years contributed.

library(survival)
p <- pyears(Surv(tm/365, event)~trt, data=d, scale=1, data.frame=T)

p$data > p$data trt pyears n event 1 0 145.32877 519 503 2 1 67.09041 481 481

There were 503 events in the 145 person years contributed by the trtr=0 group. That's a rate of 503/145.33 ~ 3.46 events per person year. This means if I followed 10 people for a year I would expect 10 people x 1 year x 3.46 = 34.6 events. Same if I had followed 5 people for 2 years, or 20 people for 6 months.

Now, onto the regression

fit <- glm(event ~ trt + offset(log(pyears)),
           data=p$data,
           family = poisson)

coeftest(fit, vcov. = sandwich)

z test of coefficients:

          Estimate Std. Error    z value  Pr(&gt;|z|)    

(Intercept) 1.2416e+00 2.5167e-32 4.9334e+31 < 2.2e-16 *** trt 7.2823e-01 8.2724e-16 8.8031e+14 < 2.2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The base rate of events in the trt=0 per 1 person year is exp(1.24) ~ 3.46 -- which is exactly what we saw. We can also see that the rate of events per 1 person year in the trt=1 group is exp(0.73) = 2.07 times as large as compared to the trt=0 group. That also makes sense from the pyear tabulation we saw above.

So far as the 180 day cliff is concerned, your data are censored at 180 days. That is still informative, because we know that those individuals do not get the outcome within 180 days. Their person years and 0 events still count towards the regression.

  • Thanks for your answer! Regarding your "Why is Logistic Regression Wrong?" section, I don't understand why this explains to me what it is wrong to use logistic regression + account for the exposure using an explanatory variable. My manager would say just use age of the people as an explanatory variable (or offset). What is wrong with that? And thank you very much for the thorough answer in "How Can We Analyze Such Data?", much appreciated <3 – Erosennin Jan 17 '24 at 07:08
  • @Erosennin even if adjusting for exposure time in a logistic regression were right, you still have the problem of censoring. Censored observations are not actually 0s -- had you observed them the entire 365 days, some of them might actually get the outcome. By using a logistic regression, you're not estimating the risk correctly. Simulate this yourself and see how risk predictions from a logistic regression differ from the true survival probability. – Demetri Pananos Jan 17 '24 at 15:17
  • To play devils advocate: I could argue they are actually 0s at the point of censoring, why would that be wrong? I know some of them would get the outcome if I followed them longer, but so would the poeple that I only observe for 365. Follow anything longer and the risk increases. But I would account for that using the exposure. – Erosennin Jan 17 '24 at 16:21
  • I could simulate, but I would like to help my intuition understand it better, so I can explain it to my manager verbally :) – Erosennin Jan 17 '24 at 16:56
  • @Erosennin OK, so it turns out this CAN be done, however you need to let the term for the length of exposure be sufficiently flexible. This paper by Efron demonstrates how – Demetri Pananos Jan 19 '24 at 22:02
  • @Erosennin. The TL;DR is that logistic regression can be used to model the hazard function as opposed to the survival function. The hazard, especially in SaaS or tech based problems, is typically non-linear and doesn't look anything like the logistic function. But, with enough flexibility, you can successfully model the hazard and then use that output to retrieve the survival function should you need that. – Demetri Pananos Jan 20 '24 at 02:54
  • There are other gotchas to be careful of (e.g. correlation between survival time and other factors, confounding more generally speaking) but it is possible to use logistic regression in these scenarios. – Demetri Pananos Jan 20 '24 at 02:55
  • Huh, I'm surprised that it can actually be done. That was counter-intuitive for me. I appreciate that you turned 180 on this one, thanks, proof of intellectual honesty! So, modeling the exposure using a spline, should make it very flexible, right? Very good point about the gotchas, thank you! – Erosennin Jan 22 '24 at 07:56
  • the link you provided doesn't work for me, which paper is it? – Erosennin Jan 22 '24 at 07:57
  • Found it: "Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve" – Erosennin Jan 22 '24 at 08:38