Working with Discrete-Time Survival Analysis with Random Factors in R

Question

I'm working with some data on artificial (fake) bird nests, looking into their 'survival' from animal attacks. I visited the nests once every two days, for 2 weeks (so my time points are 2,4,8,10,12,14) and I want to compare the hazards associated with two different nests types (Nest_type) and between nests positioned in a risky environment and a non-risky environment (Risk_treatment). I also positioned the nests across ~20 different spatially independent sites (4 nests per site), so that I want to include site (ID_site) as a random factor. I am trying to work how best to analyse the data.

Because I collected my data once every 2 weeks, I felt that it would be best to do a multilevel discrete-time survival analysis. I found this page a great resource, but I have a few questions and issues that I've run into that I would appreciate advice on.

A sample of my data

Where ID_nest marks the individual nest I have tracked over time, Enter and Exit marks the start and end time (in days) and Event marks whether the nest survived (1) or not (0).

The code I have been using:

Gompertz_Model_Full_all_NP <- glmer(formula = Event ~ Exit + Nest_type + 
                                  Risk_treatment + (1|ID_nest) + (1|ID_site),
                       family = binomial(link = "cloglog"),
                       data = Nest_all_NP)

The output:

My issue:

When I run this analysis, I get very high estimates for Nest_Type and Risk_treatment. This results in very high hazard ratios (exp(estimate); 15432.58 and 5699.47, respectively). It probably has something to do with the random factors I have added, as if I run a glm without ID_nest and ID_site, I get hazard ratios around 4 - 5. If I run the glmm without ID_nest (but include ID_site) I end up with hazard ratios of around 15 - 20. I don't really know what to do here, or why this is happening and some of my colleagues suggested that I should just not include ID_nest, but I don't want to use that approach unless I have a better reason than "the hazard ratios are too high".

My guess: it is probably because for ID_nest, some nests are taken quickly i.e. on the first day while others are taken much slower, so there is a lot of variation. For ID_site, the same thing is happening but to a lesser extent, at sites where one nest is taken, all nests are typically taken at the same time, and most sites either have all or no nests taken. Although these results make ecological sense, they are probably leading to non-proportionality, which violates the hazards ratio assumptions. I'm not sure if my guess is correct, as I haven't been able to find a solid way to test this assumption with discrete data. Also, I don't know what I could do if the assumption is not met (what other tests could I run?) I thought to use the R package survreg, but didn't know where to start with the code (and ensuring that I was covering the discrete time issue + including the random factor/s).

Any advice would be well appreciated, thanks!

Have you tried a nested random effect? Each ID_nest belongs to a single ID_site so a nested random effect might be more appropriate than using two separate random effects. — Circus pygargus, Jul 08 '20 at 08:25
Cheers for the advice, I had changed it to (1 | ID_site/ID_nest), but I'm getting the same results. — Emma, Jul 08 '20 at 11:22

Circus pygargus · Answer 1 · 2020-07-08T13:36:17.307

When I run this analysis, I get very high estimates for Nest_Type and Risk_treatment. This results in very high hazard ratios (exp(estimate); 15432.58 and 5699.47, respectively)

I am not familiar with survival analysis nor with the cloglog function but from what you wrote I am pretty sure that you found very high values because:

You used the exponential function, which is not the inverse function of the cloglog function I believe, but the inverse of the logarithm. You should look for the inverse function of the cloglog function and use it instead of the exponential.
You only used the estimate of one variable to compute your hazard ratio, you have to consider every estimates of your model, including the intercept estimate to compute it.

For instance, your model for a LBQ nest in a risky environment after 4 days is:

$$ cloglog(Event) = -14.8593 + 4 * 0.8609 + 9.6442 + 8.6481 $$

so $$ Event = inversecloglog(-14.8593 + 4 * 0.8609 + 9.6442 + 8.6481) $$

Thanks for your feedback - from my reading, to calculate the hazards ratio when using the cloglog function, one only needs to find exp(estimate). This other question/answer covers the equation/inverse function of the cloglog function: https://stats.stackexchange.com/questions/132627/interpreting-estimates-of-cloglog-logistic-regression — Emma, Jul 09 '20 at 00:05

Sextus Empiricus · Answer 2 · 2023-06-02T06:46:32.177

By adding a random intercept per nest (and possibly also per loction), you are adding a lot of flexibility in the model and potential to overfit.

With logistic regression this may cause full separation (even when the effect is random instead of fixed) and large parameter estimates.

See for the explanation of overfitting and high dimensionality here Why is logistic regression particularly prone to overfitting in high dimensions?
A case of overfitting with random effects is here Perfect separation, perhaps? In binary outcome and repeated measure (random effect) with multiple independent variables (using R)

You have measured both nest types on the same location. Then, you can use a proportional odds model which eliminates the variable risk between locations.

Also useful can be to make a plot of the survival curves (eg use the emperical distribution) to get a visual inspectation of the difference between the two next types and risk treatments. This helps to see whether the output of your function makes sense or not.

In addition, the way that you analyse the event, by a dummy variable, makes the events correlated. Once the nest is attacked on day $x$ it will also be in the state of being attacked on later days. — Sextus Empiricus, Jun 02 '23 at 06:49

Working with Discrete-Time Survival Analysis with Random Factors in R

2 Answers2