Left censoring with time-varying covariates

Question

I have a data with some participants entering the study with a disease of interest already present. For others, I have several points of observation where different factors were measured (such as weight, blood pressure etc.) and if the disease occurred or not. Event = disease onset. The aim is to check hazard ratios.

I wonder how I can include both left and right censored observations with time-varying covariates in R. Ideally, also with time-varying effects and/or random effects, but primarily use both prevalent and incident cases in regression.

Survival package handles interval censoring including left censored, though only for parametric baseline hazards (Survreg, weibull/lognormal etc type=interval2). The problem I see is that prevalent cases would enter maximum likelihood as P(T<age_start|x) with x measured at age_start, which for time-variable x such as weight will be like looking into the future as, for example, described in the survival vignette (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf). And weight at birth was clearly different and O don’t know any factors before age_start. In contrast, for the right-censored observations maximum likelihood term is P(T>age_censored|X @ age_start), for non-censored P(t is in interval k, where disease was detected| X@ start of this interval), or P(t=age_disease|X@ last measured before age_disease). Am I right that in using this method, for left censored data I imply risk factors were as measured at age_start at the time “before disease”, or 0? In which case I may not conceptually use prevalent cases as I don’t know risks preceding the event (and the model will think I know).

Another confusion is if the time-scale is in years since the study start (and not individual’s age as above), then maximum likelihood terms seem to have information on how long it took (or not) for a person to develop a disease from the time risks were measured, while in age time-scale this information is lost, as it says that a person did not have a disease at 60yo with certain weight, but not how long it was since it was measured. It should not be a problem for Cox regression as it takes the pool of all at-risk at that time, but would it be an issue for a parametric survival function in the way ML is estimated?

# EXAMPLE: 
# event = heart failure (HF), individuals observed for 10 years from age_start
# around 20% already had HF before study started, 
# event =2 for these left-censored individuals (age of HF < age_start); age_censored = age_start
# event =1 if HF was observed (age of HF is known); age_censored = age_hf
# event = 0 if HF not observed (age of HF > age at study end); age_censored = age_start + 10
set.seed(100)
df = simulatedata(N=1000)
df[1:25,c("time_0", "time_1","age_start", "age_censored", "hf_age", "event","bmi_0", "hyp_0", "gender")]
>
id time_0 time_1 age_start age_censored hf_age event bmi_0 hyp_0 gender
1       0    0.0      68.9         68.9    NaN     2  30.9     0      1
2       0    3.2      51.8         55.0   55.0     1  25.7     0      1
3       0    4.3      50.4         54.7   54.7     1  24.6     0      0
4       0   10.0      40.3         50.3    NaN     0  24.1     0      1
######### ONLY INCIDENT CASES (no left censoring)
#1) timescale re-aligned with study time 
df0 = df[df$event!=2,]
coxph(Surv(time_1, event)~bmi_0+age_start+hyp_0 + gender, data = df0)
2) timescale re-aligned with study time, parametric survival function.
#2a)
survreg(Surv(time_1, event)~bmi_+age_start+hyp_0 + gender, data = df0, dist = "weibull")
2b) Using "interval" type of data entry similar to https://academic.oup.com/aje/article/173/9/1078/122651?login=true
df$time_interval  = df$time_1
df[df$event==2, "time_interval"] = df[df$event==2, "time_0"] 
df0 = df[df$event!=2,]
survreg(Surv(time_interval, rep(NA, dim(df0)[1]), event, type = "interval")~bmi_0+hyp_0+age_start+gender, data = df0, dist = "weibull")
3) timescale re-aligned with individual's age
coxph(Surv(age_start, age_censored, event)~bmi_0+hyp_0 + gender, data = df0)
#4) timescale re-aligned with individual's age, parametric baseline survival
df$age_interval  = df$age_censored
df[df$event==2, "age_interval"] = df[df$event==2, "age_start"] 
df0 = df[df$event!=2,]
survreg(Surv(age_interval, rep(NA, dim(df0)[1]), event, type = "interval")~ bmi_0+hyp_0 + gender, data = df0, dist = "weibull")
1), 2a, 2b and 3) all give similar results
while 4) is completely different
-> this relates to my second question on age vs study time-scale for Cox vs parametric estimates
######## WITH LEFT CENSORED DATA
5) timescale with study time won't work as time_interval ==0 makes no sense, at "birth" all are event_free
survreg(Surv(time_interval, rep(NA, dim(df)[1]), event, type = "interval")~bmi_0+hyp_0+age_start+gender, data = df, dist = "lognormal")
6) timescale re-aligned with age - works in parametric, interval models,  but give very different estimates for the model params
lsweib = survreg(Surv(age_interval, rep(NA, dim(df)[1]), event, type = "interval")~bmi_0+hyp_0+gender, data = df, dist = "weibull")
summary(lsweib )
6) is similar-sh to 4) and different to 1-3)
This is the 1st part of my question: by using
interval censoring with covariates measured at study start, do I # essentially feed the model with BMI and Hypertension as if those # were known before the disease happened, and they were the same as # measured at age_start. This is not a problem for gender or any
time-constant covariate, but for varying this will bias results
#*******************************
code for the simulated data, no particular idea rather than create some dependencies of time with covariates and have cases before and after the study start/end
simulatedata = function (N=1000){
observe_time = 10
df = data.frame(age = round(rnorm(N, 50,15),1), bmi = round(rnorm(N, 26, 2),1),  hyp = rbinom(N,1,0.10), gender = rbinom(N,1,0.5))
df$bmi_ = (df$bmi-26)/2
df$age_ = (df$age - 50)/15
df$hf_age = df$age + round((rexp(N, 0.15exp((df$bmi-26)/21 + (df$hyp0.7)+ (df$age-50)/150.4))),1)
describe(df$hf_age)
df$age_start = df$age+1
df$age_end = df$age_start + observe_time 
df[df$hf_age <= df$age_start, "event"] = 2
df[df$hf_age > df$age_start + 10, "event"] = 0
df[df$hf_age <= df$age_start + 10 & df$hf_age > df$age_start, "event"] = 1
df$time_0 = 0
df[df$event ==0, "time_1"] = 10
df[df$event ==1, "time_1"] = df[df$event ==1, "hf_age"]-df[df$event ==1, "age_start"]
df[df$event ==2, "time_1"] = 0
df$age_censored = df$age_start + df$time_1
df[df$event !=1, "hf_age"] = NaN
df$bmi_0 = df$bmi
df$hyp_0 = df$hyp
return (df)
}

The choice of time = 0 is critical to setting up and interpreting any survival model. As you rightly note, different choices have different implications for modeling. What choice do you want to make? You say "The aim is to check hazard ratios," but hazard ratios and corresponding baseline hazard functions may differ depending on your choice. It's also not completely clear what "event" you are trying to model; is it death or something else? Please provide that information by editing your question, as comments are easy to overlook and can get deleted. — EdM, Jun 27 '21 at 17:32
Event is onset of a disease, say, heart failure, and for some people I only know that by certain age this already happened, but don't know when exactly. I would like to include both into analysis to estimate how covariates affect the instant rate of disease. But there are two problems 1) using time-scale won't work as S(0) =1, so I should use age-related time-scale. Only parametric survreg handles left censoring and somehow gives different results with the age (while results were similar to Cox in usual the time-scale). — DianaS, Jun 28 '21 at 00:26

score 3 · Accepted Answer · answered Jun 30 '21 at 18:15

I think that you understand the situation a bit better than you are giving yourself credit for. Here are the critical issues.

Events in survival analysis

Survival modeling with covariates (whether fixed in time or time-varying) typically evaluates the association of current covariate values with the hazard of an event at that time. If the event has already happened to some individuals before your study began and an event can only happen at most once (as seems to be the case here), then you don't have any information available from those individuals about covariate-event associations for any unpredictable time-varying covariates. In your terminology in that situation, "prevalent" cases at your study start time can't be used to evaluate associations of time-varying covariates with the disease incidence evaluated by survival modeling.

If you are evaluating covariates fixed in time (like a genotype) then such "prevalent" cases might provide information. If you are examining recurrent events or building a multi-state model (e.g., modeling heart failure, death from heart failure, and death from other causes all together) then they might provide information about subsequent events. But otherwise you run the risk of providing the model with incorrect information about time-varying covariate values as of the event time.

Time origin in survival analysis

The choice of time = 0 depends on what type of survival you want to model based on your interest in the subject matter. That's not a statistical question, although it has implications for statistical analysis.

Obviously, whether you are modeling time since study start or time since birth will tend to give different answers in general. Furthermore, if you are using birth as time = 0, then you have left-truncated values for age, as you have no information about those individuals before their age at study entry. People who never reached those ages aren't in the study at all, so left truncation needs to be incorporated into the modeling.

There is, however, an important difference in how that choice of time origin will affect a parametric versus a Cox model.

A parametric model specifies the overall characteristics of the hazard/survival curves versus time starting from time = 0--however you chose that time. If your participants are on the order of 50 years old and you have a parametric model, there will necessarily be a large difference in how covariates affect the shape of the hazard and associated survival curves depending on your choice of what time = 0 means. In a parametric model with birth as time = 0, you have no information about the shape of the curve or covariate-outcome associations for the first several decades until events start. Nevertheless, your analysis of events in later decades is specifying the shape of the survival curve and the covariate associations with survival dating back to birth.*

A Cox model, in contrast, effectively learns the baseline hazard from the cases themselves. The baseline hazard doesn't enter the initial calculations at all, but it then can be estimated from the model and data. There is no initial assumption about the shape of the hazard. In a Cox model with birth as time = 0, you would just have a flat survival curve (zero hazard) for those initial decades, and the associations of covariates with outcome will only be evaluated after events started happening. That might be more realistic than trying to impose a survival-curve shape that dates back to birth with a parametric model.

*It's possible that properly handling the left truncation of age would improve parametric modeling in your situation with birth at time = 0. Also, you might not need to resort to parametric modeling to handle left censoring; the icenReg package in R evidently can handle semi-parametric survival modeling with left censoring (although I have no experience with it). The package author has been known to visit this site and might be able to provide further help on associated statistical issues.

. I should have guessed having all cases far from birth is restrictive for parametric survival, especially with left truncation. Even if I observed all population as in the simulation, I exclude people with the event already happened before the study. Fitting the curve to no events in the first n years and many events later on would result in unrealistic hazards (bad fit) at any age. Study timescale seems easier with flexibility of age * other factors dependencies, but in health/ageing diseases natural switch to age, I’ll check IcenReg — DianaS, Jun 30 '21 at 21:25

score 1 · Answer 2 · answered Jul 02 '21 at 11:32

The first question was kindly and very precisely answered by @EdM, I am posting here the summary and final findings.

Parametric survival models can give very different results to Cox and other semi-parametric models in left truncated data (when observation is not from t=0, but from t_start>>0 till some t_end). This could also happen in a "normal study", if time-scale is changed to participants age (as opposed to the study time): if all participants are older than some X (which is always the case) and time=0 is aligned with their birth (age=0) parametric model would estimate a baseline hazard which is 0 for all before X and whatever the data says for age >X, so total fit maybe poor and so estimated for the hazards.

-> Parametric models should be carefully checked with left-truncated data (i.e. when observation starts from t_start >>0). In the example below I show that if one splits the time and fits parametric baseline survival separately, estimates get close to the Cox ones.

Left-censoring (when you want the model to account for cases with events happening before t_start, for those T_event<t_start) is more difficult. For studies with time-varying covariates when those were unknown pre- event happening including such observation is wrong and will bias results. However, if all covariates stay the same throughout all participants life, these could be included (gender, other genotypes, family history of diseases etc). However, very few models actually support it. In survival package this is only parametric models and abovementioned issues will be present. Cox does not support it, neither are PAMM or penCoxFrail, one should look further.
Example of the code with various combinations of hazard ratios (for sample without left-censoring): I post here for the benefit of other researchers. There is Cox in normal and age-aligned time-scales, then transformation into Poisson regression with and without splitting the time into few intervals in both scales (https://cran.r-project.org/web/packages/survival/vignettes/approximate.pdf). Then, there is lognormal and exponential parametric survival and estimates from PAMM package (https://rdrr.io/cran/pammtools/#vignettes).

To note: poisson without time split is essentially exponential model (so results are the same); Cox in both time-scales is the same; PAM models are piece-wise exponential models but with many more time splits (automatic once you use their function). So the only models with skewed estimates are pure parametric (paramLogN_age and paramExp_age = poissonAge) in age time-scale. Once I allowed for young ages and older ones to have different baseline hazards, estimates are fine (pam_age, Coxage, poissonAgeSplits, pam2_age).

results
            Cox     Coxage    poisson poissonAge poissonSplit poissonAgeSplits  paramLogN
bmi_    0.9733043  1.0131712  0.9914530  0.4511458    0.8359002        0.8993826 -1.0355784
hyp_0   0.9214284  1.0277725  0.9465455  0.3165312    0.6854277        0.8979358 -1.0035910
gender -0.1259785 -0.1367177 -0.1264453 -0.1057727   -0.1275595       -0.1815141  0.2051099
age_    0.3868312         NA  0.3930111  0.0000000    0.3664537        0.0000000 -0.4132584
       paramLogN_age paramLogN_interval paramExp_age       pam1   pam1_age       pam2   pam2_age
bmi_     -0.17210121         -1.0355784   -0.4511458  0.9764120  1.0063108  0.9740870  1.0171761
hyp_0    -0.14796025         -1.0035910   -0.3165312  0.9280788  1.0154508  0.9243951  0.9834993
gender    0.03263244          0.2051099    0.1057727 -0.1271054 -0.1365939 -0.1257701 -0.1308872
age_              NA         -0.4132584           NA  0.3880939         NA  0.3860855         NA
##CODE###
set.seed(100)
df = simulatedata(N=1000)
df[1:25,c("time_0", "time_1","age_start", "age_censored", "hf_age", "event","bmi_0", "hyp_0", "gender")]
df$age_interval  = df$age_censored
df[df$event==2, "age_interval"] = df[df$event==2, "age_start"]
df$time_interval  = df$time_1
df[df$event==2, "time_interval"] = df[df$event==2, "time_0"]
df$length = df$time_1 - df$time_0
dfright = df[df$event!=2,]
Coxm <- coxph(S ~ bmi_+ hyp_0 + gender + age_, dfright, ties='breslow')
Coxm_age <- coxph(Surv(age_start, age_censored, event)~bmi_ +hyp_0 + gender, data = dfright)
Poisson1 = glm(formula = event ~ bmi_ +  hyp_0 + gender + age_ + offset(log(time_1 -time_0)), family = poisson, data = dfright)
Poisson1_age = glm(event ~ bmi_ + hyp_0 + gender + offset(log(age_censored)), family=poisson, data=dfright)
paramLogN_interval = survreg(Surv(time_interval, rep(NA, dim(dfright)[1]), event, type = "interval")~bmi_+hyp_0+gender + age_, data = df0, dist = "lognormal")
paramLogN = survreg(Surv(time_1, event)~bmi_+hyp_0 + gender + age_, data = dfright, dist = "lognormal")
paramLogN_age = survreg(Surv(age_interval, rep(NA, dim(dfright)[1]), event, type = "interval")~ bmi_+hyp_0+gender, data = df0, dist = "lognormal")
paramEXP_age = survreg(Surv(age_interval, rep(NA, dim(dfright)[1]), event, type = "interval")~ bmi_+hyp_0+gender, data = df0, dist = "exponential")
#################### Using Poisson transformation #############################
as in here https://cran.r-project.org/web/packages/survival/vignettes/approximate.pdf
#looking at where to split
plot(survfit(S~1), fun = "cumhaz")
lines(c(0, 4, 7, 10), c(0, .57, 0.9, 1.08), col=2, lwd=2)
#splitting 
kdata2 <- survSplit(S ~., data=dfright, cut=c(4, 7),episode="interval")
kdata2$length0 = kdata2$S[,2] - kdata2$S[,1]
#same for age-scale
Sage = Surv(dfright$age_start, dfright$age_censored, dfright$event)
plot(survfit(Sage~1), fun = "cumhaz")
lines(c(0, 20, 60, 80, 100), c(0, 0.5, 5, 7.7, 11), col=2, lwd=2)
kdata2age <- survSplit(Sage ~., data=dfright, cut=c(0,20,60,80),episode="interval")
kdata2age$length0 = kdata2age$Sage[,2] - kdata2age$Sage[,1]
#estimating params using gpm/poisson
PoissonSplit = glm(event ~ bmi_ + hyp_0 + gender +  age_ +offset(log(length0)), family=poisson, data=kdata2)
PoissonAgeSplit = glm(event ~ bmi_ + hyp_0 + gender + offset(log(length0)), family=poisson, data=kdata2age)
#################### Using PAMM  https://www.youtube.com/watch?v=ZvQH0lBDwWc #############################
ped = dfright %>% as_ped(Surv(time_1, event) ~.)
ped_age = dfright %>% as_ped(Surv(age_start,  age_censored, event) ~.)
pam1 = mgcv::gam(formula = ped_status ~ s(tend) +bmi_+ hyp_0 + gender + age_, data = ped, family = poisson())
pam2 = pamm(formula = ped_status ~ s(tend)+ bmi_ + hyp_0+ gender + age_, data = ped )
pam1_age = mgcv::gam(formula = ped_status ~ s(tend) +bmi_+ hyp_0 + gender , data = ped_age, family = poisson())
pam2_age = pamm(formula = ped_status ~ s(tend)+ bmi_ + hyp_0+ gender , data = ped_age )
summary(pam1_age)
summary(pam2_age)
#############results ###################
results = cbind(Cox= summary(Coxm)$coefficients[c("bmi_", "hyp_0", "gender","age_"),1],
            Coxage =Coxm_age$coefficients[c("bmi_", "hyp_0", "gender", "age_")],
                poisson = summary(Poisson1)$coefficients[2:5, 1],
            poissonAge = rbind(summary(Poisson1_age)$coefficients[c("bmi_", "hyp_0", "gender"), ], 0)[,1],
                poissonSplit = summary(PoissonSplit)$coefficients[2:5, 1],
            poissonAgeSplits = rbind(summary(PoissonAgeSplit)$coefficients[c("bmi_", "hyp_0", "gender"), ],0)[,1],
            paramLogN = summary(paramLogN)<span class="math-container">$coefficients[c("bmi_", "hyp_0", "gender","age_")],
        paramLogN_age= summary(paramLogN_age)$</span>coefficients[c(&quot;bmi_&quot;, &quot;hyp_0&quot;, &quot;gender&quot;,&quot;age_&quot;)],                 
            paramLogN_interval = summary(paramLogN_interval)<span class="math-container">$coefficients[c("bmi_", "hyp_0", "gender","age_")],
        paramExp_age = summary(paramEXP_age)$</span>coefficients[c(&quot;bmi_&quot;, &quot;hyp_0&quot;, &quot;gender&quot;,&quot;age_&quot;)] )

results = cbind(results,
                pam1 = summary(pam1)$p.coeff[2:5],
            pam1_age = summary(pam1_age)$p.coeff[2:5],
                pam2 = summary(pam2)$p.coeff[2:5],
            pam2_age = summary(pam2_age)$p.coeff[2:5])
results

Left censoring with time-varying covariates

2) timescale re-aligned with study time, parametric survival function.

2b) Using "interval" type of data entry similar to https://academic.oup.com/aje/article/173/9/1078/122651?login=true

3) timescale re-aligned with individual's age

1), 2a, 2b and 3) all give similar results

while 4) is completely different

-> this relates to my second question on age vs study time-scale for Cox vs parametric estimates

5) timescale with study time won't work as time_interval ==0 makes no sense, at "birth" all are event_free

6) timescale re-aligned with age - works in parametric, interval models, but give very different estimates for the model params

6) is similar-sh to 4) and different to 1-3)

This is the 1st part of my question: by using

interval censoring with covariates measured at study start, do I # essentially feed the model with BMI and Hypertension as if those # were known before the disease happened, and they were the same as # measured at age_start. This is not a problem for gender or any

time-constant covariate, but for varying this will bias results

code for the simulated data, no particular idea rather than create some dependencies of time with covariates and have cases before and after the study start/end

2 Answers2

as in here https://cran.r-project.org/web/packages/survival/vignettes/approximate.pdf

Linked