I have been using the cph function of the rms package in R to fit an Andersen-Gill (AG) model for recurrent time to events. I include time-varying covariates in this model as per the 1982 paper from Andersen and Gill - for example I use the "dynamic" covariate recurrent outcome history to model the within-subject dependence in the recurrent events.
Now I wish to calculate survival probabilities for each of the unique observed recurrent event times (unique observed times from all subjects) for each subject. Of particular concern is that I am not using the "id" keyword and so the "cph" function "naively" believes that each row of the dataset is a separate subject. My understanding is that this approach is correct for estimation but I am uncertain if this is correct for survival probabilities.
When I say above that "My understanding is that this approach is correct for estimation" I mean with a "full" set of covariates that explain the within subject dependence, then conditional on these covariates the repeated events within a subject are independent. If some covariates are missing then this does not hold and some dependence will go unexplained. In this situation naively using the "cph" method leads to unbiased estimates but requires the "cluster" command to adjust the standard errors.
My R code to fit the model is as follows where cov1-cov3 are my covariates and subjid is my subject index;
AG_mod = cph(Surv(tstart,tstop,censored)~cov1+cov2+cov3+cluster(subjid),x=TRUE,y=TRUE,data=mydata,surv=TRUE,singular.ok=TRUE)
Then (in pseudo code) I do the following to calculate the survival probabilities;
for each subject
datai = data for subject i
surv_fit_i = survfit(AG_mod,newdata=dati,id=subjid,conf.type="none",se.fit=FALSE);
end
Later update to pseudo code (the above code produces errors)
for each subject
#get the data for ith subject
datai = data for subject i
#get the time intervals as a survival object
intervals = Surv(dati$tstart,dati$tstop,dati$censored)
#get the covariate data
covsi =dati[,c("subjid","cov1","cov2","cov3"),with=FALSE]
#combine intervals with covariate data
covsi = data.frame(covsi,intervals)
#estimate the survival curve for this subject
surv_fit = survfit(AG_mod,newdata=covsi,conf.type="none",se.fit=FALSE,id=covsi$subjid,censor=TRUE)
end
As far as I can tell the "id=covsi$subjid" statement tells R that multiple records belong to the same subject and so I am guessing that equation (2) below is implemented?
Later update to clarify model
The model I am fitting is an "intensity" model as per the 1982 Andersen-Gill definition
$E[dN_{i}(t)|Y_{i}(t),\mathcal{F}_{t^{-}}]=Y_{i}(t)\alpha_{0}(t)\exp(\beta N_{i}(t^{-}))$
where the history $\mathcal{F}_{t^{-}}$ contains all information about the counting process $N_{i}(t)$ up to "just before" time $t$, and this includes information on fixed and time-varying covariates (including the "dynamic" covariates time since last outcome and number of previous outcomes), and previous censoring events. This intensity model assumes the observed "jumps" in the counting process for subject $i$ at times $t_{1},t_{2},...$, $N_{i}(t_{1}),N_{i}(t_{2}),...$ are conditionally independent given this history $\mathcal{F}_{t^{-}}$. Furthermore since $\mathcal{F}_{t^{-}}$ contains the censoring history then unbiased estimators of $\beta$ can be obtained even if the censoring is informative - all that is required is a weaker assumption called Coarsening at Random (CAR) which requires at any time $t$ that censoring depends only on the observed data (and this could include dependence on the observed outcome history).
In an intensity model the dependence within subjects is modelled with the dynamic covariates so that the quantity $N_{i}(s)-\int_{0}^{t}Y_{i}(s)\alpha_{0}(s)\exp(\beta N_{i}(s^{-}))ds$ is a martingale. The way I think of this is there should be no subject-specific random variation in the residuals since this is captured by the dynamic covariates. In contrast omitting some or all of the dynamic covariates means the quantity $N_{i}(s)-\int_{0}^{t}Y_{i}(s)\alpha_{0}(s)\exp(\beta N_{i}(s^{-}))ds$ in general is not a martingale - this is called a "rate" rather than an intensity model and was descibed by Lin et al 2000. Rate models have unbiased estimators of $\beta$ but require an empirical estimator for the estimator variance. The way I think about this is there will be some unexplained subject-specific variation in the residuals from rate models because we "naively" estimate $\beta$ assuming conditional independence.
If an intensity process does not depend in some way on the counting process history then by definition of the intensity of a counting process the recurrent events must also be independent. In particular if the counting process is a Poisson process the recurrent events must also be independent.
Update regarding survival probabilities
An estimator of the survival probability $S(t|x_{i})$ for covariate vector $x_{i}$ is
\begin{align} \hat{S}(t|x_{i})=\prod_{T_{j}\leq t}\{1-\Delta\hat{A}(T_{j}|x_{i})\}, \end{align}
where $\Delta\hat{A}(T_{j}|x_{i})=1/Y(T_{j})$ is an estimator of the "increment" of the cumulative hazard $\hat{A}(t|x_{i})$. For a fixed covariate vector $x_{i}(s')$, for a point $s'\in(0,t]$, it is sometimes not meaningful to calculate the cumulative hazard in the time interval $[0,t]$, but nonetheless this can be accomplished by
\begin{align} \hat{A}(t|x_{i})=\exp(\hat{\beta} x_{i}(s'))\int_{0}^{t}d\hat{A}_{0}(u),\hspace{200pt}(1) \end{align}
where
\begin{align*} \hat{A}_{0}(t)&=\int_{0}^{t}\hat{\alpha}_{0}(u)du\\ &=\sum_{T_{j}\leq t}\frac{1}{\sum\nolimits_{l\in\mathcal{R}_{j}}\exp(\hat{\beta} x_{l}(T_{j}))}, \end{align*}
where $\mathcal{R}_{j}:=\{l:Y_{l}(T_{j})=1\}$ is the risk set of subject indices at observed event time $T_{j}$.
Alternatively the cumulative hazard corresponding to the whole treatment path $x_{i}(s)$ for $0<s\leq t$ can make more sense and can be calculated as
\begin{align} \hat{A}(t|x_{i})=\int_{0}^{t}\exp(\hat{\beta} x_{i}(u))d\hat{A}_{0}(u),\hspace{200pt}(2) \end{align}
It is equation (2) that I believe makes most sense to calculate at any given $t$, but what I am wondering is exactly how is this performed in R?
If so, what exactly do you want to obtain? Survival curve of the terminal event, given a certain history of recurrent events (i.e. if a subject has recurrent events at t1, t2, t3, then what is the survival curve conditional on these events) ?
– Theodor Mar 11 '16 at 12:46