5

I have a data with some participants entering the study with a disease of interest already present. For others, I have several points of observation where different factors were measured (such as weight, blood pressure etc.) and if the disease occurred or not. Event = disease onset. The aim is to check hazard ratios.

I wonder how I can include both left and right censored observations with time-varying covariates in R. Ideally, also with time-varying effects and/or random effects, but primarily use both prevalent and incident cases in regression.

Survival package handles interval censoring including left censored, though only for parametric baseline hazards (Survreg, weibull/lognormal etc type=interval2). The problem I see is that prevalent cases would enter maximum likelihood as P(T<age_start|x) with x measured at age_start, which for time-variable x such as weight will be like looking into the future as, for example, described in the survival vignette (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf). And weight at birth was clearly different and O don’t know any factors before age_start. In contrast, for the right-censored observations maximum likelihood term is P(T>age_censored|X @ age_start), for non-censored P(t is in interval k, where disease was detected| X@ start of this interval), or P(t=age_disease|X@ last measured before age_disease). Am I right that in using this method, for left censored data I imply risk factors were as measured at age_start at the time “before disease”, or 0? In which case I may not conceptually use prevalent cases as I don’t know risks preceding the event (and the model will think I know).

Another confusion is if the time-scale is in years since the study start (and not individual’s age as above), then maximum likelihood terms seem to have information on how long it took (or not) for a person to develop a disease from the time risks were measured, while in age time-scale this information is lost, as it says that a person did not have a disease at 60yo with certain weight, but not how long it was since it was measured. It should not be a problem for Cox regression as it takes the pool of all at-risk at that time, but would it be an issue for a parametric survival function in the way ML is estimated?

# EXAMPLE: 
# event = heart failure (HF), individuals observed for 10 years from age_start
# around 20% already had HF before study started, 
# event =2 for these left-censored individuals (age of HF < age_start); age_censored = age_start
# event =1 if HF was observed (age of HF is known); age_censored = age_hf
# event = 0 if HF not observed (age of HF > age at study end); age_censored = age_start + 10
set.seed(100)
df = simulatedata(N=1000)
df[1:25,c("time_0", "time_1","age_start", "age_censored", "hf_age", "event","bmi_0", "hyp_0", "gender")]
>
id time_0 time_1 age_start age_censored hf_age event bmi_0 hyp_0 gender
1       0    0.0      68.9         68.9    NaN     2  30.9     0      1
2       0    3.2      51.8         55.0   55.0     1  25.7     0      1
3       0    4.3      50.4         54.7   54.7     1  24.6     0      0
4       0   10.0      40.3         50.3    NaN     0  24.1     0      1

######### ONLY INCIDENT CASES (no left censoring) #1) timescale re-aligned with study time df0 = df[df$event!=2,] coxph(Surv(time_1, event)~bmi_0+age_start+hyp_0 + gender, data = df0)

2) timescale re-aligned with study time, parametric survival function.

#2a) survreg(Surv(time_1, event)~bmi_+age_start+hyp_0 + gender, data = df0, dist = "weibull")

2b) Using "interval" type of data entry similar to https://academic.oup.com/aje/article/173/9/1078/122651?login=true

df$time_interval = df$time_1 df[df$event==2, "time_interval"] = df[df$event==2, "time_0"] df0 = df[df$event!=2,] survreg(Surv(time_interval, rep(NA, dim(df0)[1]), event, type = "interval")~bmi_0+hyp_0+age_start+gender, data = df0, dist = "weibull")

3) timescale re-aligned with individual's age

coxph(Surv(age_start, age_censored, event)~bmi_0+hyp_0 + gender, data = df0)

#4) timescale re-aligned with individual's age, parametric baseline survival df$age_interval = df$age_censored df[df$event==2, "age_interval"] = df[df$event==2, "age_start"] df0 = df[df$event!=2,] survreg(Surv(age_interval, rep(NA, dim(df0)[1]), event, type = "interval")~ bmi_0+hyp_0 + gender, data = df0, dist = "weibull")

1), 2a, 2b and 3) all give similar results

while 4) is completely different

-> this relates to my second question on age vs study time-scale for Cox vs parametric estimates

######## WITH LEFT CENSORED DATA

5) timescale with study time won't work as time_interval ==0 makes no sense, at "birth" all are event_free

survreg(Surv(time_interval, rep(NA, dim(df)[1]), event, type = "interval")~bmi_0+hyp_0+age_start+gender, data = df, dist = "lognormal")

6) timescale re-aligned with age - works in parametric, interval models, but give very different estimates for the model params

lsweib = survreg(Surv(age_interval, rep(NA, dim(df)[1]), event, type = "interval")~bmi_0+hyp_0+gender, data = df, dist = "weibull") summary(lsweib )

6) is similar-sh to 4) and different to 1-3)

This is the 1st part of my question: by using

interval censoring with covariates measured at study start, do I # essentially feed the model with BMI and Hypertension as if those # were known before the disease happened, and they were the same as # measured at age_start. This is not a problem for gender or any

time-constant covariate, but for varying this will bias results

#*******************************

code for the simulated data, no particular idea rather than create some dependencies of time with covariates and have cases before and after the study start/end

simulatedata = function (N=1000){ observe_time = 10 df = data.frame(age = round(rnorm(N, 50,15),1), bmi = round(rnorm(N, 26, 2),1), hyp = rbinom(N,1,0.10), gender = rbinom(N,1,0.5)) df$bmi_ = (df$bmi-26)/2 df$age_ = (df$age - 50)/15 df$hf_age = df$age + round((rexp(N, 0.15exp((df$bmi-26)/21 + (df$hyp0.7)+ (df$age-50)/150.4))),1) describe(df$hf_age) df$age_start = df$age+1 df$age_end = df$age_start + observe_time df[df$hf_age <= df$age_start, "event"] = 2 df[df$hf_age > df$age_start + 10, "event"] = 0 df[df$hf_age <= df$age_start + 10 & df$hf_age > df$age_start, "event"] = 1 df$time_0 = 0 df[df$event ==0, "time_1"] = 10 df[df$event ==1, "time_1"] = df[df$event ==1, "hf_age"]-df[df$event ==1, "age_start"] df[df$event ==2, "time_1"] = 0 df$age_censored = df$age_start + df$time_1 df[df$event !=1, "hf_age"] = NaN df$bmi_0 = df$bmi df$hyp_0 = df$hyp return (df) }

DianaS
  • 438
  • 1
    The choice of time = 0 is critical to setting up and interpreting any survival model. As you rightly note, different choices have different implications for modeling. What choice do you want to make? You say "The aim is to check hazard ratios," but hazard ratios and corresponding baseline hazard functions may differ depending on your choice. It's also not completely clear what "event" you are trying to model; is it death or something else? Please provide that information by editing your question, as comments are easy to overlook and can get deleted. – EdM Jun 27 '21 at 17:32
  • Event is onset of a disease, say, heart failure, and for some people I only know that by certain age this already happened, but don't know when exactly. I would like to include both into analysis to estimate how covariates affect the instant rate of disease. But there are two problems 1) using time-scale won't work as S(0) =1, so I should use age-related time-scale. Only parametric survreg handles left censoring and somehow gives different results with the age (while results were similar to Cox in usual the time-scale). – DianaS Jun 28 '21 at 00:26
  • 1
  • Overall seems that including left censored observations with time-varying covariates is flawed, as their values preceding the event are unknown. And also moving to age scale becomes problematic as the model thinks that covariates were measured at birth, which is different for each individual, and not at the same time as it would be in the normal time-scale.
  • – DianaS Jun 28 '21 at 00:27