Quantiles of the response variable in generalised linear model with binary outcome (survival analysis)

Question

I have a dataset that looks more or less like this (interval-censored outcome, multiple screening events per patient, not synchronised):

ID <- c(1,1,1,1,2,2,2,2,3,3,3)
sex <- c(rep("F", 8), rep("M", 3))
screen_time_since_diagnosis <- c(48, 400, 760, 900, 20, 360, 800, 1400, 0, 400, 600)
outcome <- c(0,0,0,1,0,0,0,0,0,0,1)
diabetes <- as.data.frame(cbind(ID, sex, screen_time_since_diagnosis, outcome), 
                          stringsAsFactors = T)

And am running a generalised linear model with complementary log-log link function like this:

model <- glm(outcome~ sex + screen_time_since_diagnosis,
    data = diabetes, 
    family = binomial(link = "cloglog"))

However, risk calculation is not my end goal. I want to know the time at which the risk of having an event for an individual patient is equal to a certain threshold. In Cox PH it can be done with the quantile.surfit method. There is a package called ciTools that allows one to calculate quantiles of the response variable of a glm object, but it doesn't work for binary variables (when I try to run it I get a message Error in add_quantile.glm(df = newdata, model, p = 0.2) : Prediction intervals for Bernoulli response variables aren't useful.

Why are prediction intervals for bernoulli response variables not useful?
Is there any other way to calculate this to get similar output to when using quantile() on survfit or Surv objects?

Many thanks!

EdM · Accepted Answer · 2022-09-16T15:48:38.653

The first question is straightforward. With a Bernoulli outcome at predicted probability $p$, the variance is $p(1-p)$. The simulation approach of ciTools has nothing to add. You could use the Wald confidence intervals around the linear-predictor value that you specify (based on the covariate values and their coefficients), to get CI around a predicted probability, which is probably what you want. The emmeans package might help with that.

The second question is more complicated. A discrete-time survival model directly evaluates the probability of an event during each time interval, conditional on having been at risk during that interval, as a function of the covariates in place. That's the discrete-time hazard, $h_{t_i}$. The survival curve is the product of the complements of those discrete-time hazards:

$$S(t) =P(T>t)=\prod_{t_i \le t}(1-h_{t_i})$$

I want to know the time at which the risk of having an event for an individual patient is equal to a certain threshold.

For that, you need to use a predict() function on the model, with new data that represent a complete person-period record for the individual. That is, the new data need to include a separate row for the hypothesized individual at each event time in the model. If you ask for the probability prediction based on those new data, you get a vector of hazards over time. Then apply the above formula (or the equivalent estSurv() function in the discSurv package) to get survival over time. Solve for the time at which the survival function equals the desired quantile.

That approach can be extended by similar evaluation of upper and lower confidence limits at each event time and finding when those confidence intervals intersect with the desired quantile. That's how quantile.survfit() calculates CI for a quantile.

Your data and non-synchronized follow-up times

The above holds for a properly constructed discrete-time survival model. The way you formatted your data isn't correct for that, and it's not clear that your type of data can be handled adequately by a discrete-time survival model.

Discrete-time survival data analyzed with a glm() binomial model need to be in person-period format. You need to start with all of the event times in the data, then make sure that each individual (while at risk) has a separate row in the data for all of those event times, with the time, status, and covariate values included in the row.

Your data format doesn't do that. Consider the event at time = 600 for ID3. Model evaluation at time = 600 needs to take into account the fact that both ID1 and ID2 are also at risk at that time, but neither individual has a data row for that value of time. Thus glm() has no way of knowing to include those individuals in the evaluation. Similarly for the event at time = 900 for ID1, glm() has no way of knowing to include the at-risk ID2 in the evaluation. You need to put your data into the correct person-period format.

Even then, with unsynchronized follow-up times, you might end up having to make assumptions about event times that make the analysis questionable. Say that ID4 has an event at time = 100 while ID5 has interval-censored survival with no event at time = 90 but an event before time = 110. When you set up the data row for ID5 at time = 100, will you treat that time for ID5 as no-event or as an event? With the unsynchronized observation times you can't make an informed decision.

Although discrete-time binomial models with complementary log-log links make sense for synchronized observation times, they thus can lose precision with unsynchronized observation times. You would be better off with modeling that takes unsynchronized interval censoring into account, as provided for example by the R icenReg package.

Thanks a lot for your answer @EdM, it took me a while to process it! three follow-ups if you can: 1) Could you point me to some literature that explains why the GLM with cloglog has to have synchronised event time? 2) IcenReg takes Surv objects with type "interval2" only- this means I'm not able to create a model with time-varying covariates, is that correct? 3) Surv also has type="inteval", which then for outcome takes 0 = right censored, 1= event at time, 2=left censored, 3= interval censored. Does that mean I cannot create a model with interval-censored outcome with time-varying covariates? — Wojty, Sep 16 '22 at 14:13
@WojciechBanaś see this answer for a link to Tutz and Schmid on cloglog, binomial regression and proportional hazards. The problem with unsynchronized observation times is in the next to last paragraph of the current answer: if an observation time for one individual is within the interval-censored event time interval for a second individual, do you count the second individual as censored or having the event? — EdM, Sep 16 '22 at 15:22
@WojciechBanaś as of a few years ago, the author of the icenReg package said: "I'm not quite sure whether there are time-dependent covariate implementations [for interval censored data] available at this time." You might start with this 2006 paper on the topic, examine related papers in PubMed, and find later papers that refer to it. — EdM, Sep 16 '22 at 15:32
@WojciechBanaś I suspect that this might be handled nicely by a Bayesian survival model, but that's far outside my expertise. As a work-around, you could consider what's often done in practice with disease progression in cancer: the actual progression occurs between clinical visits (interval-censored), but the event time is recorded as the time of the clinical visit at which progression was first detected. The event is then strictly the "detection of progression." That introduces a bias with respect to the actual progression time, but it might be acceptable for your application. — EdM, Sep 16 '22 at 15:46
I found a mention about a Poisson model that approximates the piece-wise exponential model: https://data.princeton.edu/wws509/notes/c7s4 but I'm struggling to find any other sources that would mention this model. Would it make sense to use it in a setting with asynchronous interval-censored outcome with time-updated covariates (at each screening episode)? — Wojty, Oct 13 '22 at 21:47
@Wojty The piecewise exponential method can model baseline hazards but it still depends on point values for event times. That might be clearest in Section 7.4.3 of your link on The Equivalent Poisson Model: "If the individual died or was censored in the interval... then the time lived in the interval is... the difference between the total time lived and the lower boundary of the interval." (Emphasis added.) The total time lived for individual $i$, $t_i$, is clearly a point value in the prior exposition. — EdM, Oct 14 '22 at 13:58
@Wojty for some applications it can be good enough to model the time until the event is observed rather than the time at which it occurred. The time to event observation doesn't require handling of interval censoring, as you observe it at some definite time. That estimate will be biased from the event occurrence time, however. You will often see that done, for example, in studies of cancer recurrence. The date of the clinical visit at which the recurrence was observed is often used as a fixed event time, even though the actual recurrence date is in the interval between clinical visits. — EdM, Oct 14 '22 at 14:07

Quantiles of the response variable in generalised linear model with binary outcome (survival analysis)

1 Answers1