Using R to calculate survival probabilities with time-varying covariates using an Andersen-Gill model

Question

I have been using the cph function of the rms package in R to fit an Andersen-Gill (AG) model for recurrent time to events. I include time-varying covariates in this model as per the 1982 paper from Andersen and Gill - for example I use the "dynamic" covariate recurrent outcome history to model the within-subject dependence in the recurrent events.

Now I wish to calculate survival probabilities for each of the unique observed recurrent event times (unique observed times from all subjects) for each subject. Of particular concern is that I am not using the "id" keyword and so the "cph" function "naively" believes that each row of the dataset is a separate subject. My understanding is that this approach is correct for estimation but I am uncertain if this is correct for survival probabilities.

When I say above that "My understanding is that this approach is correct for estimation" I mean with a "full" set of covariates that explain the within subject dependence, then conditional on these covariates the repeated events within a subject are independent. If some covariates are missing then this does not hold and some dependence will go unexplained. In this situation naively using the "cph" method leads to unbiased estimates but requires the "cluster" command to adjust the standard errors.

My R code to fit the model is as follows where cov1-cov3 are my covariates and subjid is my subject index;

AG_mod = cph(Surv(tstart,tstop,censored)~cov1+cov2+cov3+cluster(subjid),x=TRUE,y=TRUE,data=mydata,surv=TRUE,singular.ok=TRUE)

Then (in pseudo code) I do the following to calculate the survival probabilities;

for each subject

    datai = data for subject i

    surv_fit_i = survfit(AG_mod,newdata=dati,id=subjid,conf.type="none",se.fit=FALSE);   

end

Later update to pseudo code (the above code produces errors)

for each subject
       #get the data for ith subject
       datai = data for subject i

       #get the time intervals as a survival object
       intervals = Surv(dati$tstart,dati$tstop,dati$censored)

        #get the covariate data
        covsi =dati[,c("subjid","cov1","cov2","cov3"),with=FALSE]            

        #combine intervals with covariate data
        covsi = data.frame(covsi,intervals)

        #estimate the survival curve for this subject
        surv_fit = survfit(AG_mod,newdata=covsi,conf.type="none",se.fit=FALSE,id=covsi$subjid,censor=TRUE)           

    end

As far as I can tell the "id=covsi$subjid" statement tells R that multiple records belong to the same subject and so I am guessing that equation (2) below is implemented?

Later update to clarify model

The model I am fitting is an "intensity" model as per the 1982 Andersen-Gill definition

$E[dN_{i}(t)|Y_{i}(t),\mathcal{F}_{t^{-}}]=Y_{i}(t)\alpha_{0}(t)\exp(\beta N_{i}(t^{-}))$

where the history $\mathcal{F}_{t^{-}}$ contains all information about the counting process $N_{i}(t)$ up to "just before" time $t$, and this includes information on fixed and time-varying covariates (including the "dynamic" covariates time since last outcome and number of previous outcomes), and previous censoring events. This intensity model assumes the observed "jumps" in the counting process for subject $i$ at times $t_{1},t_{2},...$, $N_{i}(t_{1}),N_{i}(t_{2}),...$ are conditionally independent given this history $\mathcal{F}_{t^{-}}$. Furthermore since $\mathcal{F}_{t^{-}}$ contains the censoring history then unbiased estimators of $\beta$ can be obtained even if the censoring is informative - all that is required is a weaker assumption called Coarsening at Random (CAR) which requires at any time $t$ that censoring depends only on the observed data (and this could include dependence on the observed outcome history).

In an intensity model the dependence within subjects is modelled with the dynamic covariates so that the quantity $N_{i}(s)-\int_{0}^{t}Y_{i}(s)\alpha_{0}(s)\exp(\beta N_{i}(s^{-}))ds$ is a martingale. The way I think of this is there should be no subject-specific random variation in the residuals since this is captured by the dynamic covariates. In contrast omitting some or all of the dynamic covariates means the quantity $N_{i}(s)-\int_{0}^{t}Y_{i}(s)\alpha_{0}(s)\exp(\beta N_{i}(s^{-}))ds$ in general is not a martingale - this is called a "rate" rather than an intensity model and was descibed by Lin et al 2000. Rate models have unbiased estimators of $\beta$ but require an empirical estimator for the estimator variance. The way I think about this is there will be some unexplained subject-specific variation in the residuals from rate models because we "naively" estimate $\beta$ assuming conditional independence.

If an intensity process does not depend in some way on the counting process history then by definition of the intensity of a counting process the recurrent events must also be independent. In particular if the counting process is a Poisson process the recurrent events must also be independent.

Update regarding survival probabilities

An estimator of the survival probability $S(t|x_{i})$ for covariate vector $x_{i}$ is

\begin{align} \hat{S}(t|x_{i})=\prod_{T_{j}\leq t}\{1-\Delta\hat{A}(T_{j}|x_{i})\}, \end{align}

where $\Delta\hat{A}(T_{j}|x_{i})=1/Y(T_{j})$ is an estimator of the "increment" of the cumulative hazard $\hat{A}(t|x_{i})$. For a fixed covariate vector $x_{i}(s')$, for a point $s'\in(0,t]$, it is sometimes not meaningful to calculate the cumulative hazard in the time interval $[0,t]$, but nonetheless this can be accomplished by

\begin{align} \hat{A}(t|x_{i})=\exp(\hat{\beta} x_{i}(s'))\int_{0}^{t}d\hat{A}_{0}(u),\hspace{200pt}(1) \end{align}

where

\begin{align*} \hat{A}_{0}(t)&=\int_{0}^{t}\hat{\alpha}_{0}(u)du\\ &=\sum_{T_{j}\leq t}\frac{1}{\sum\nolimits_{l\in\mathcal{R}_{j}}\exp(\hat{\beta} x_{l}(T_{j}))}, \end{align*}

where $\mathcal{R}_{j}:=\{l:Y_{l}(T_{j})=1\}$ is the risk set of subject indices at observed event time $T_{j}$.

Alternatively the cumulative hazard corresponding to the whole treatment path $x_{i}(s)$ for $0<s\leq t$ can make more sense and can be calculated as

\begin{align} \hat{A}(t|x_{i})=\int_{0}^{t}\exp(\hat{\beta} x_{i}(u))d\hat{A}_{0}(u),\hspace{200pt}(2) \end{align}

It is equation (2) that I believe makes most sense to calculate at any given $t$, but what I am wondering is exactly how is this performed in R?

Is it correct that you have individual survival time? And that for every individuals you have some recurrent event times at some time points, and you use the number of events as a time-dependent covariate (which changes at every recurrent event time)?
If so, what exactly do you want to obtain? Survival curve of the terminal event, given a certain history of recurrent events (i.e. if a subject has recurrent events at t1, t2, t3, then what is the survival curve conditional on these events) ? — Theodor, Mar 11 '16 at 12:46
Thanks for your comment @Theodor, I have no terminal event other than censoring, and yes as you describe I have recurrent events at t1, t2, t3, .. for a subject. What I wish to do is to calculate the survival probability at time $T=t$ conditional on the time varying covariates. So say $t2< t<t3$ then the covariates recorded at t2 will be the most up to date covariates before $t$, call these $x(t)$, and I wish to calculate the survival probability at time $T=t|x(t)$ — dandar, Mar 11 '16 at 14:48
Yes where I wish to calculate $S[t|x(t)]:=P[T>t|x(t)]$ which is the probability of the time of the next event being greater than $t$. Thus as $t$ varies the covariate vector also varies, but nonetheless I we should still end up with a "survival" curve that decreases towards zero as $t$ increases - just like we do when we have covariates that do not vary with time? — dandar, Mar 12 '16 at 22:59
In the line "$E[dN_{i}(t)|Y_{i}(t),\mathcal{F}_{t^{-}}]= ...$" there should be a "dt" on the right-hand side. — VictorZurkowski, Apr 16 '18 at 23:07

score 2 · Answer 1 · answered Mar 14 '16 at 13:13

My line of thought goes like this. As per the AG model, each individual is represented by a counting process $N_i(t)$ with intensity $\lambda_i(t)$ which can be written as $\lambda_i(t) = \lambda_0(t) \exp(\beta' x_i(t))$. By this it is implicit that only the "current" value of $x_i$ matters for the intensity. The counting process grows when an individual has an event. I also assume independent censoring (actually stopping of the process). Note that these assumptions are implicit from the way that you fitted the model.

The probability of no events in $(a,b)$ (improperly, a "survival") is then $$ S_i(t|s) = \exp \left( -\int_s^t \lambda_0(u) \exp(\beta'x_i(u)) du \right) $$ Note that the probability of no events does not directly relate to the probability of one event, as the complementary event to "no events in a period" is "at least one event during a period".

Then say you are interested in $S(t|s_0)$. Then the idea would be to fix $x_i = x_i(s_0)$ (i.e. assume the covariates don't change after $s_0$), and we would get $$ S_i(t|s_0) = S_i(t) / S_i(s_0) $$

Since the part before $s_0$ cancels out, together with my assumption that only the current value of $x_i$ matters for the intensity, means that this is equal to $$ S_i(t|s_i) = \exp \left( -\exp(\beta'x_i(s_0)) \int_{s_0}^t \lambda_0(u)du \right) $$

Then in R you can do it like that (I give an example with a data set):

library(frailtypack)
data(readmission)

mod1 <- coxph(Surv(t.start, t.stop, event) ~ sex + charlson + cluster(id), data = readmission)

# set the covariate values at s_0
mycov <- data.frame(sex = "Female", charlson = "1-2")
sf <- survfit(mod1, newdata = mycov)

# with different s_0
par(mfrow=c(2,2))
time <- c(300, 500, 700, 1000)
for(i in time) {
  pos <- which.max(sf$time[sf$time <= i])
  S_s0 <- sf$surv[pos]

  with(sf, plot(time[pos:length(time)], surv[pos:length(surv)] / S_s0), type = "l")
}

Here you get the plots of the survival curves, which correspond to the values $(t, S(t|s_0))$ for $t\geq s_0$.

The other comments that I would give on this are the following: it is difficult to talk about the distribution of the next event. This is because in the AG formulation the time since previous event does not play any role. In other words, if you would like to take that into account, more complicated stochastic models should be used, where for example you include the "previous number of events" as a time-dependent covariate. This does complicate things a lot and the interpretation of the estimated quantities is most likely very difficult.

The second comment I have is about the nature of the time dependent covariates. Mostly, the AG works nicely with "external" covariates, such as air pollution, or something which is not directly measured on the subject ("external" of the recurrent event process). This is mostly because the first expression I wrote here is the probability of no events during $(a,b)$ relies on the assumption that the number of events in any given interval is Poisson distributed. This is true if the covariates are external. A discussion on this can be found in several textbooks, for example in Cook & Lawless at section 2.5. If your time-dependent covariates do depend on the recurrent event process, then it should be modeled jointly with the recurrent events process.

Thanks @Theodor for your answer - I really appreciate this but will need some time to digest your comments. For now I have updated my original question to clarify the recurrent event model I am using. I will get back to you ASAP. — dandar, Mar 14 '16 at 19:10
Even though @Theodor your equation $S_i(t|s) = \exp \left( -\int_s^t \lambda_0(u) \exp(\beta'x_i(u)) du \right)$ does not use the product form as mine does in the question (not important) is your equation not similar in principle to equation (1) - i.e. path of the covariate vector through time from $T=0$ to $T=t$ is not taken into account? Does this makes sense? Should I be using something like my equation (2)? Furthermore I see your R code call to "survfit" does not use the "id" statment. Is it true that using the "id" statement gives equation (2) and not using it gives (1) or your equation? — dandar, Mar 15 '16 at 09:39
In terms of your general comments @Theodor on the AG model, I am not saying you are wrong since I am no expert, but I believe time-varying covariates can be used which depend on the outcome history since this ensures conditional independence of all the recurrent events within a subject. I believe this is in the spirit of the original model proposed by Andersen and Gill, 1982. Even though there are reasons why we would not wish to include these in the model (i.e. they "steal strength" from non-dynamic covariates, and in designed experiments they are post-randmoisation) they are "allowed". — dandar, Mar 15 '16 at 09:47
To follow on from above: we might fit dynamic covariates to help understand the true nature of the dependence of the outcomes on the outcome history, but for understanding the effect of a non-dynamic covariate (say the effect of treatment in a clinical trial) we would not want the non-dynamic covariate effect underestimated, nor would we want to risk introducing bias into the estimate, and so we would fit an AG model without dynamic covariates - strictly not an AG model but a "rate" model as proposed by Lin et al 2000. — dandar, Mar 15 '16 at 09:52
@dandar my point about the nature of time-dependent covariates is mostly about taking care in using them. For example, if you use the previous number of events as a time-dependent covariate, you implicitly specifiy a different stochastic model for the process, which has a complicated interpretation. Andersen & Gill suppose that z(t) is a vector of covariates which is "predictable". I think that, on the contrary, this is not in the spirit of "internal" covariates. — Theodor, Mar 15 '16 at 11:47
@dandar Now I also see the updates to your post. I think it is indeed a nice problem. I do have my doubts on $N_i(t-)$ in the covariate part. It can be done and the model can be fitted, but I am not sure about the validity of the obtained model. I also do not know how regression coefficients or other estimated quantities can be interpreted (I generally prefer a random effect approach in these kind of situations). I think such models are discussed in the Cook & Lawless book on Chapter 5 ("General intensity models"). Maybe it is worth a read. — Theodor, Mar 15 '16 at 11:53
@dandar My equation with $S_i(t|s)$ is defined as the probability of no events during that period. The $\hat S$ that you mentioned is an estimate for this quantity. I am inclined to think that with $N_i(t-)$ this quantity does not have this interpretation any more. There is also a vignette in the survival package "Adjusted Survival Curves" which might give more insight on how R calculates these kind of things. — Theodor, Mar 15 '16 at 11:58

Using R to calculate survival probabilities with time-varying covariates using an Andersen-Gill model

1 Answers1

Linked