Modelling a binary outcome when census interval varies

Question

For a current piece of work I’m trying to model the probability of tree death for beech trees in a woodland in the UK. I have records of whether trees were alive or dead for 3 different census periods along with data on their diameter and growth rate. Each tree has an ID number so it can be identified at each time interval. However, the census intervals vary so that for the time between one survey and another is either 4, 12 or 18 years. Obviously the longer the census period the greater the probability a tree will have died by the time it is next surveyed. I had problems making a realistic reproducible example so you can find the data here.

The variables in the dataset are:

ID - Unique ID for tree
Block - the ID for the 20x20m plot in which the tree was located
Dead - Status of tree, either dead (1) or alive (0)
GR - Annual growth rate from previous survey
DBH - diameter of tree at breast height
SL - Length of time between censuses in years

Once a tree is recorded as dead it disappears from subsequent surveys.

Ideally I would like to be able to estimate the annual probability of mortality of a tree using information on diameter and growth rate. Having searched around for quite a while I have seen that logistic exposure models appear able to account for differences in census periods by using an altered version of logit link for binomial models as detailed by Ben Bolker here. This was originally used by Shaffer to determine the daily probability of bird nest survival where the age (and therefore exposure) of the nest differed. I've not seen it used outside of the context of models of nest survival but it seems like I should be able to use it to model survival/mortality where the exposure differs.

require(MASS)
logexp <- function(exposure = 1)
{
  linkfun <- function(mu) qlogis(mu^(1/exposure))
  ## FIXME: is there some trick we can play here to allow
  ##   evaluation in the context of the 'data' argument?
  linkinv <- function(eta)  plogis(eta)^exposure
  logit_mu_eta <- function(eta) {
    ifelse(abs(eta)>30,.Machine$double.eps,
           exp(eta)/(1+exp(eta))^2)
    ## OR .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
  }
  mu.eta <- function(eta) {       
    exposure * plogis(eta)^(exposure-1) *
      logit_mu_eta(eta)
  }
  valideta <- function(eta) TRUE
  link <- paste("logexp(", deparse(substitute(exposure)), ")",
                sep="")
  structure(list(linkfun = linkfun, linkinv = linkinv,
                 mu.eta = mu.eta, valideta = valideta, 
                 name = link),
            class = "link-glm")
}

At the moment my model looks like this, but I will incorporate more variables as I go along:

require(lme4)
Dead<-read.csv("Stack_dead.csv",)


M1<-glmer(Dead~DBH+(1|ID),data=Dead,family=binomial(logexp(Dead$SL))) 
#I use (1|ID) here to account for the repeated measurements of the same individuals
    summary(M1)

plot(Dead$DBH,plogis(predict(M1,re.form=NA)))

Primarily I want to know:

Does the statistical technique I am using to control for the difference in time between census seem sensible? If it isn't, can you think of a better way to deal with the problem?
If the answer to the first question is yes, is using the inverse logit (plogis) the correct way to get predictions expressed as probabilities?

Thanks in advance for any help!

score 8 · Accepted Answer · edited Mar 20 '23 at 17:42

An alternative and slightly easier approach is to use a complementary log-log link (cloglog), which estimates the log-hazard rather than the log-odds of outcomes such as mortality. Copying from rpubs:

A very common situation in ecology (and elsewhere) is a survival/binary-outcome model where individuals (each measured once) differ in their exposure. The classical approach to this problem is to use a complementary log-log link. The complementary log-log or "cloglog" function is $C(\mu)=\log(−\log(1−\mu))$; its inverse is $\mu=C^{−1}(\eta)=1−\exp(−\exp(\eta))$. Thus if we expect mortality $\mu_0$ over a period $\Delta t=1$ and the linear predictor $\eta=C^{−1}(\mu_0)$ then $$ C^{−1}(\eta+\log\Delta t)=(1−\exp(−\exp(\eta) \cdot \Delta t)) $$ Some algebra shows that this is equal to $1−(1−\mu_0)^{\Delta t}$, which is what we want.

The function $\exp(−\exp(x))$ is called the Gompertz function (it is also the CDF of the extreme value distribution), so fitting a model with this inverse-link function (i.e. fitting a cloglog link to the survival, rather than the mortality, probability) is also called a gompit (or extreme value) regression.

To use this approach in R, specify family=binomial(link="cloglog") and add a term of the form offset(log(exposure)) to the formula (alternatively, some modeling functions take offset as a separate argument). For example,

glm(surv~x1+x2+offset(log(exposure)),
    family=binomial(link="cloglog"),
    data=my_surv_data)

where exposure is the length of time for which a given individual is exposed to the possibility of dying/failing (e.g., census interval or time between observations or total observation time).

You may also want to consider checking the model where log(exposure) is included as a covariate rather than an offset - this makes the log-hazard have a $\beta_t \log(t)$ term, or equivalently makes the hazard proportional to $t^{\beta_t}$ rather than to $t$ (I believe this makes the survival distribution Weibull rather than exponential, but I haven't checked that conclusion carefully).

Advantages of using this approach rather than Schaffer's power-logistic method:

because the exposure time is incorporated in the offset rather than in the definition of the link function itself, R handles this a bit more gracefully (for example, it will be easier to generate predictions with different exposure times from the original data set).
it is slightly older and more widely used in statistics; Googling "cloglog logistic regression" or searching for cloglog on CrossValidated will bring you to more resources.

The only disadvantage I can think of off the top of my head is that people in the nest survival world are more used to Schaffer's method. For a large enough data set you might be able to tell which link actually fits the data better (e.g. fit with both approaches and compare AIC values), but in general I doubt there's very much difference.

Thanks for the answer Ben. It looks like the conditional log-log link may be the way to go since, as you say, there is more info out there on it. One sub-query, one I have accounted for the different exposure, when the data is transformed to the original scale how is it interpreted? Is it the probability of death per year in my case? — Phil_Martin, May 06 '15 at 09:40
is the census interval here just inserted as a predictor then? Or something else? Can you write out how to implement the model in R? — colin, May 03 '16 at 13:40
it was already implicit in the answer but I've tried to make it a bit more explicit — Ben Bolker, May 03 '16 at 14:48
Following up years later - if the census interval is a predictor in the linear model, it seems like the effect of x1 or x2 on the outcome are independent of how much time has passed. This is a bit counterintuitive. For instance, If you live by the ocean, you're more likely to experience a flood than if you live on a mountain. However, that probability scales with time, it is not independent of time. How does the cloglog account for this? Or is the above model mis-specified and it should be offset(log(exposure))? — colin, Jul 28 '19 at 06:22
I believe this counterintuitive property is precisely the proportional hazards assumption, which you can read about in much more detail elsewhere ... — Ben Bolker, Jul 28 '19 at 15:45
specifically, because the modeling is on the log-hazard scale, it means that the hazard is always proportional to time - a covariate with a positive coefficient of (say) 2.0 will increase the log-hazard by 2; the hazard itself will be proportional to $exp(\beta_1 x_1) t$ ... — Ben Bolker, Jul 28 '19 at 15:56

Modelling a binary outcome when census interval varies

1 Answers1

Linked