Cross-validation curves in Cox model with time dependant covariate

Question

I need to make cross-validation curves to check the validity of a Cox model. So the data are divided in training and test. The coefficients are estimated on the training and the predictor linears are calculated from the test with the coefficients estimated in the training. But then how to draw a calibration plot such as calPlot, knowing that the coxph model includes a time dependant covariate? I am asking both for the method and the coding.

Thanks a lot!

library("rms")
vet2 <- survSplit(Surv(time, status) ~ ., data= veteran, 
            cut=c(90, 180),
                  episode= "tgroup", id="id")
test = vet2[vet2$id %in% (1:25),]
training = vet2[vet2$id %in% (26:137),]
coxtrain = coxph(Surv(tstart, time, status) ~ age + diagtime:strata(tgroup), data = training)

But then what to do ?

EdM · Answer 1 · 2022-12-08T21:30:35.853

This type of calibration requires making predictions from a Cox model, based on a set of time-varying covariates in the counting-process data format. Such predictions aren't even possible in several software packages. This page and its links go into more detail. There is substantial question when, if ever, such predictions make sense: if you have a set of covariate values for an individual at a specified time, you already know that the individual has survived that long. That risks circular reasoning and survivorship bias.

To summarize: the Python lifelines package won't allow such predictions at all. The calibrate() function in the rms package won't work with such data (although there is still some post-modeling functionality provided by that package for such models). The predict.coxph() function in the standard survival package can produce a particular type of per-subject prediction, as outlined in that last link, but you have to pay very strict attention to the data format that function, or the survfit.coxph() function, or any related function in another package expects. Read the manuals carefully. (Coding questions per se are off-topic on this site.)

Even if you are willing to overlook those issues, there still is a problem in how you will define the predicted survival probabilities, at some particular point in time, that would be used to construct a survival model's calibration curve. That's central to the calibration curve, for both "observed" and "predicted" values. For example, the manual page for the calibrate() function in the rms package says:

For survival models, "predicted" means predicted survival probability at a single time point, and "observed" refers to the corresponding Kaplan-Meier survival estimate, stratifying on intervals of predicted survival, or, if the polspline package is installed, the predicted survival probability as a function of transformed predicted survival probability using the flexible hazard regression approach...

For covariates that are fixed in time that's straightforward. The baseline covariates hold over the entire time course covered by the events in the original data, so the predicted survival for an individual is well defined, based on the baseline hazard and those fixed covariate values, even beyond that individual's last follow-up or event time.

But what do you use for an individual's corresponding "predicted" survival with time-varying covariates? You can get survival-curve estimates for new individuals having time-varying covariate values via the id argument to the survfit() function. But what covariate values do you use after the last observation time for an individual, if for calibration you need a "predicted" survival at a time later than that individual's last observation? Either you make some arbitrary choice, or you are stuck with predictions (and calibration) only at the shortest follow-up time within your test cohort. That's pretty limiting.

So I'm not sure that there is a reliable way to do calibration curves for a Cox model with time-varying covariates. You can do some model validation via cross-validation or bootstrapping, however. For example, the validate() function in the rms() package can work on such Cox models as it doesn't require any survival predictions.

Thank you very much for your detailed answer! I have a covariate for which we have a single value per patient but its effect on survival is not proportional over time. So I added a stratification over time. Calibration curves with a regular Cox model (without stratification) are not very good. I wondered if maybe I could handle this non-proportionality, in order to improve my calibration curves? Maybe my problem is actually simpler than trying to make predictions with time-varying coefficients. Would you have a suggestion for me? — Flora Grappelli, Dec 12 '22 at 14:09
@FloraGrappelli stratification over time, as in Section 4.1 of the R time dependence vignette, still involves the Surv(start, stop, event) data format of time-varying covariates. You can do some types of model validation as noted in the answer, but so far as I know there is no reliable way to construct a true calibration curve with such a model and data. — EdM, Dec 12 '22 at 14:25
Despite stratification over time, individuals have only one covariate value over time. It is only the coefficient associated with the covariate that is not the same depending on the period considered. Does it simplify the problem to consider a time varying coefficient and not a time varying covariate? — Flora Grappelli, Dec 12 '22 at 14:39
Edm: can I use the hare approach as you suggested here: https://stats.stackexchange.com/questions/206291/how-to-make-calibration-plot-for-survival-data-without-binning-data? — Flora Grappelli, Dec 12 '22 at 15:01
@FloraGrappelli my initial reaction was wrong, at least in part. If you separate out the data for the first time period for prediction, you should be able to do calibration at least for that time period. If there are no other time-varying values you could, with good justification, construct a prediction data set for validation that carries over the time-fixed covariate values into the second (and any subsequent) period and use survfit() with the id argument to get survival predictions for all cases and times. You would probably have to code that yourself. — EdM, Dec 12 '22 at 15:15
@FloraGrappelli you might be able to use the hare method for getting estimates of "observed" probabilities, but I don't have any experience using it directly. I've only used it implicitly in calls to the rms::calibrate() function. Binning by predicted probability of survival is an alternate option, which might work OK if you have a lot of data. — EdM, Dec 12 '22 at 15:20
Thanks for your help. I noticed that using Survsplit data belonging only to the first period ([0-5] years), I get exactly the same calibration graphs as taking the whole data, when I evaluate my prediction at 5 y. So maybe I don't need to complicate the thing and I just take the data belonging to the 1st period (t1=1 on Survsplit) to validate the model and follow the classical procedure. Does that make any sense? — Flora Grappelli, Dec 12 '22 at 23:21
BTW : it seems that the calibrate function does not handle external validation. I use val.surv from rms — Flora Grappelli, Dec 12 '22 at 23:23

Cross-validation curves in Cox model with time dependant covariate

1 Answers1