15

I wish to create a toy survival (time to event) data which is right censored and follows some distribution with proportional hazards and constant baseline hazard.

I created the data as follows, but I am unable to obtain estimated hazard ratios that are close to the true values after fitting a Cox proportional hazards model to the simulated data.

What did I do wrong?

R codes:

library(survival)

#set parameters
set.seed(1234)

n = 40000 #sample size


#functional relationship

lambda=0.000020 #constant baseline hazard 2 per 100000 per 1 unit time

b_haz <-function(t) #baseline hazard
  {
    lambda #constant hazard wrt time 
  }

x = cbind(hba1c=rnorm(n,2,.5)-2,age=rnorm(n,40,5)-40,duration=rnorm(n,10,2)-10)

B = c(1.1,1.2,1.3) # hazard ratios (model coefficients)

hist(x %*% B) #distribution of scores

haz <-function(t) #hazard function
{
  b_haz(t) * exp(x %*% B)
}

c_hf <-function(t) #cumulative hazards function
{
  exp(x %*% B) * lambda * t 
}

S <- function(t) #survival function
{
  exp(-c_hf(t))
}

S(.005)
S(1)
S(5)

#simulate censoring

time = rnorm(n,10,2)

S_prob = S(time)

#simulate events

event = ifelse(runif(1)>S_prob,1,0)

#model fit

km = survfit(Surv(time,event)~1,data=data.frame(x))

plot(km) #kaplan-meier plot

#Cox PH model

fit = coxph(Surv(time,event)~ hba1c+age+duration, data=data.frame(x))

summary(fit)            

cox.zph(fit)

Results:

Call:
coxph(formula = Surv(time, event) ~ hba1c + age + duration, data = data.frame(x))

  n= 40000, number of events= 3043 

             coef exp(coef) se(coef)     z Pr(>|z|)    
hba1c    0.236479  1.266780 0.035612  6.64 3.13e-11 ***
age      0.351304  1.420919 0.003792 92.63  < 2e-16 ***
duration 0.356629  1.428506 0.008952 39.84  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
hba1c        1.267     0.7894     1.181     1.358
age          1.421     0.7038     1.410     1.432
duration     1.429     0.7000     1.404     1.454

Concordance= 0.964  (se = 0.006 )
Rsquare= 0.239   (max possible= 0.767 )
Likelihood ratio test= 10926  on 3 df,   p=0
Wald test            = 10568  on 3 df,   p=0
Score (logrank) test = 11041  on 3 df,   p=0

but true values are set as

B = c(1.1,1.2,1.3) # hazard ratios (model coefficients)
  • 1
    for your task, a quick start is to use an existing simulation package: https://cran.r-project.org/web/packages/survsim/index.html – zhanxw Jan 18 '17 at 07:48

3 Answers3

27

It is not clear to me how you generate your event times (which, in your case, might be $<0$) and event indicators:

time = rnorm(n,10,2) 
S_prob = S(time)
event = ifelse(runif(1)>S_prob,1,0)

So here is a generic method, followed by some R code.


Generating survival times to simulate Cox proportional hazards models

To generate event times from the proportional hazards model, we can use the inverse probability method (Bender et al., 2005): if $V$ is uniform on $(0, 1)$ and if $S(\cdot \,|\, \mathbf{x})$ is the conditional survival function derived from the proportional hazards model, i.e. $$ S(t \,|\, \mathbf{x}) = \exp \left( -H_0(t) \exp(\mathbf{x}^\prime \mathbf{\beta}) \vphantom{\Big(} \right) $$ then it is a fact that the random variable $$ T = S^{-1}(V \,|\, \mathbf{x}) = H_0^{-1} \left( - \frac{\log(V)}{\exp(\mathbf{x}^\prime \mathbf{\beta})} \right) $$ has survival function $S(\cdot \,|\, \mathbf{x})$. This result is known as ``the inverse probability integral transformation''. Therefore, to generate a survival time $T \sim S(\cdot \,|\, \mathbf{x})$ given the covariate vector, it suffices to draw $v$ from $V \sim \mathrm{U}(0, 1)$ and to make the inverse transformation $t = S^{-1}(v \,|\, \mathbf{x})$.


Example [Weibull baseline hazard]

Let $h_0(t) = \lambda \rho t^{\rho - 1}$ with shape $\rho > 0$ and scale $\lambda > 0$. Then $H_0(t) = \lambda t^\rho$ and $H^{-1}_0(t) = (\frac{t}{\lambda})^{\frac{1}{\rho}}$. Following the inverse probability method, a realisation of $T \sim S(\cdot \,|\, \mathbf{x})$ is obtained by computing $$ t = \left( - \frac{\log(v)}{\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})} \right)^{\frac{1}{\rho}} $$ with $v$ a uniform variate on $(0, 1)$. Using results on transformations of random variables, one may notice that $T$ has a conditional Weibull distribution (given $\mathbf{x}$) with shape $\rho$ and scale $\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})$.


R code

The following R function generates a data set with a single binary covariate $x$ (e.g. a treatment indicator). The baseline hazard has a Weibull form. Censoring times are randomly drawn from an exponential distribution.

# baseline hazard: Weibull

# N = sample size    
# lambda = scale parameter in h0()
# rho = shape parameter in h0()
# beta = fixed effect parameter
# rateC = rate parameter of the exponential distribution of C

simulWeib <- function(N, lambda, rho, beta, rateC)
{
  # covariate --> N Bernoulli trials
  x <- sample(x=c(0, 1), size=N, replace=TRUE, prob=c(0.5, 0.5))

  # Weibull latent event times
  v <- runif(n=N)
  Tlat <- (- log(v) / (lambda * exp(x * beta)))^(1 / rho)

  # censoring times
  C <- rexp(n=N, rate=rateC)

  # follow-up times and event indicators
  time <- pmin(Tlat, C)
  status <- as.numeric(Tlat <= C)

  # data set
  data.frame(id=1:N,
             time=time,
             status=status,
             x=x)
}

Test

Here is some quick simulation with $\beta = -0.6$:

set.seed(1234)
betaHat <- rep(NA, 1e3)
for(k in 1:1e3)
{
  dat <- simulWeib(N=100, lambda=0.01, rho=1, beta=-0.6, rateC=0.001)
  fit <- coxph(Surv(time, status) ~ x, data=dat)
  betaHat[k] <- fit$coef
}

> mean(betaHat)
[1] -0.6085473
ocram
  • 21,851
  • Thank you for your excellent answer. I realized I had messed up the event times by getting events status after I randomized event times, which didn't make sense.. silly me! – stats_newb Jan 27 '15 at 09:12
  • May I ask is there any specific reason why you draw censoring time from an exponential distribution? – pthao May 16 '16 at 05:06
  • @pthao: there is no particular reason (this was just an illustration where I used the exponential distribution) – ocram May 16 '16 at 05:43
  • 1
    Is there any guideline for choosing the distribution for censoring times? – pthao May 16 '16 at 06:51
  • @ocram Interestingly, when I run flexsurvreg(Surv(time, status) ~ x, data=dat, dist = "weibull") on the same simulated data, the coefficient appears as 0.6212. Why is this? – neither-nor Jun 04 '19 at 23:50
  • The question you have to ask yourself is: What is the parametrisation used in flexsurvreg? – ocram Jun 05 '19 at 05:27
  • @ocram Thanks for answering my novice question. Running flexsurvreg on dat does indeed return very different shape and scale parameter values (5.2 and 2.5, respectively). However, my objective is to parametrize lambda and rho in simulWeib. If flexsurveg predicts a contradictory effect of x, then how else can I chose the values for lambda and rho? – neither-nor Jun 07 '19 at 06:55
  • @ocram, Is there a way to extract the lambda from the quick simulation as you did with the $\beta$. I am simulating data with different event rates and will want to check if the final data perform well. Thanks – user1916067 Apr 30 '21 at 20:58
  • @user1916067: I fit a semi-parametric Cox model. Thus, lambda is not part of the model. If you want to perform some simulations about lambda, you should go with a parametric model. – ocram May 03 '21 at 06:10
  • Hi. I wonder how do you set different censoring proportions if you want to consider multiple difference cases where the censoring proportion differs? – user10386405 Oct 23 '21 at 13:11
  • In the example above, you can change the rateC parameter. – ocram Oct 24 '21 at 16:59
6

Based on @ocram answer, toy data can also be created by using a built-in R function, rweibull(), with a scaled scale $\mathbf{\lambda^\prime}(\mathbf{\beta})$ where $$ \mathbf{\lambda^\prime} = \frac{\mathbf{\lambda}}{\exp(\frac{\mathbf{x}^\prime \mathbf{\beta}}{\mathbf{\rho}})}. $$ Because $$ S(t \,|\, \mathbf{x}) = \exp \left( -H_0(t) \exp(\mathbf{x}^\prime \mathbf{\beta}) \vphantom{\Big(} \right) $$ and S, or 1 - CDF, of Weibull distribution is $$ S(t) = \exp \left( {-(t/\mathbf{\lambda})^\mathbf{\rho}} \right) $$ so $$ S(t \,|\, \mathbf{x},\mathbf{\lambda}) = \exp\left({-(t/\mathbf{\lambda})^\mathbf{\rho}\exp(\mathbf{x}^\prime \mathbf{\beta})}\right) = \exp\left({-(t/\mathbf{\lambda})^\mathbf{\rho}\exp(\mathbf{x}^\prime \mathbf{\beta}/\mathbf{\rho})^\mathbf{\rho}}\right) = \exp\left(-\left(\frac{t}{\frac{\mathbf{\lambda}}{\exp(\mathbf{x}^\prime \mathbf{\beta}/\mathbf{\rho})}}\right)^\mathbf{\rho}\right) =S(t \,|\, \mathbf{x},\mathbf{\lambda}^\mathbf{\prime}) $$

R code with alternative draw for event times

simulWeib <- function(N, lambda, rho, beta, rateC)
{
  # covariate --> N Bernoulli trials
  x <- sample(x=c(0, 1), size=N, replace=TRUE, prob=c(0.5, 0.5))

Weibull latent event times

v <- runif(n=N)

Tlat <- (- log(v) / (lambda * exp(x * beta)))^(1 / rho)

An alternative draw for event times

lambda_wiki = lambda^(-1 / rho) #change definition of lambda to Wikipedia's lambda_prime = lambda_wiki / exp(x * beta / rho) #re-scale according to beta Tlat = rweibull(length(x), shape=rho, scale=lambda_prime)

censoring times

C <- rexp(n=N, rate=rateC)

follow-up times and event indicators

time <- pmin(Tlat, C) status <- as.numeric(Tlat <= C)

data set

data.frame(id=1:N, time=time, status=status, x=x) } set.seed(1234) betaHat <- rep(NA, 1e3) for(k in 1:1e3) { dat <- simulWeib(N=100, lambda=0.01, rho=1, beta=-0.6, rateC=0.001) fit <- coxph(Surv(time, status) ~ x, data=dat) betaHat[k] <- fit$coef }

> mean(betaHat) [1] -0.6085473

3

For Weibull distribution,
S(t) = $e^{-(\lambda * e^(x * \beta)*t)^\rho}$

"$^{(1/rho)}$" will be only for log(v)

so, I modified like this

Tlat <- (- log(v))^(1 / rho) / (lambda * exp(x * beta))

if rho = 1, result will be same.

unk
  • 31