Is this an adequate example for simulation and modeling uncertainty in survival and hazard curve estimates?

Question

I'm studying simulation and in particular uncertainty in survival curve estimates, by drawing from a distribution of parameters (for the Weibull distribution), using the variance-covariance matrix from the MASS package. In response to post How to generate simulation paths for Weibull distribution in R?, I've come up with the following example. My question is, is this an adequate example for simulating the Weibull distribution and in deriving survival and hazard curve estimates?

I show the output plots below and the code at the bottom of this post.

Code:

library(survival)
library(MASS)
Simulate survival data
n <- 100
t <- rweibull(n, shape = 2, scale = 3)  # observed survival times for each individual in the study
cens <- rbinom(n, size = 1, prob = 0.2) # 1 = survival time right-censored (still alive/lost; 0 = event-censored
x1 <- rnorm(n)                          # x1 & x2 are hypothetical covariates (predictors) for individuals in the study
x2 <- rnorm(n)                          # see above
data <- data.frame(t, cens, x1, x2)
Fit Cox PH model
model <- coxph(Surv(t, cens) ~ x1 + x2, data = data)
Generate random samples from multivariate normal distribution
beta <- coef(model)
vcov <- vcov(model)
samples <- mvrnorm(1000, mu = beta, Sigma = vcov)
Generate new survival times; new_time = simulated new survival times
data$new_time <- NA
for (i in 1:nrow(data)) {
  shape <- exp(samples[i, "x1"])
  scale <- exp(samples[i, "x2"])
  data$new_time[i] <- qweibull(runif(1), shape = shape, scale = scale)
}
Generate survival curves
fit <- survfit(Surv(new_time, cens) ~ 1, data = data)
sfit <- summary(fit)
tseq <- seq(min(sfit$time), max(sfit$time), length = 100)
surv <- survfit(Surv(new_time, cens) ~ 1, data = data.frame(new_time = tseq))
haz <- -diff(log(surv$surv))/diff(surv$time)
Plot survival and hazard curves, and histogram of simulated survival times
par(mfrow=c(1,3))
plot(surv, xlab = "Time", ylab = "Survival Probability", main = "Survival Function")
plot(surv$time[-1], haz, type = "l",xlab = "Time", ylab = "Hazard Function", main = "Hazard Function")
hist(data$new_time, breaks = 20, xlab = "Survival Time", ylab = "Frequency", main = "Simulated Survival Times")
Calculate average hazard rate per time period
haz_by_time <- split(haz, factor(floor(surv$time[-1])))
mean_haz <- unlist(lapply(haz_by_time, function(x) mean(x, na.rm = TRUE)))

score 1 · Accepted Answer · answered Apr 23 '23 at 16:53

This seems unnecessarily convoluted. As I understand your code, you sample some simulated survival times based on a fixed Weibull model, put the sampled times together for a Cox model, then sample from the joint distribution of the Cox model coefficients to get "new" survival times from another Weibull model. (For that last step, it's not immediately clear that the Cox model coefficients from which you sample are related to the shape and scale of qweibull(),although I might be missing something there.)

If you want to generate simulations to illustrate the vagaries of fitting a parametric Weibull model, cut out the intermediate Cox model. Sample from the joint distribution of shape and scale values from a parametric Weibull fit and show the corresponding variability in Weibull-based survival curves.

If you want to illustrate the vagaries of fitting a Cox model, you would be better off with a resampling approach as the baseline hazard over time is fixed, estimated from the entire data sample. Sampling based on the variance-covariance matrix of Cox-model coefficients thus would not take into account the variability in estimating the baseline hazard. Repeating the Cox modeling on multiple bootstrapped samples of the data and showing the corresponding survival curves would take into account both the variability in coefficient estimates and the variability in the estimates of the baseline hazard.

I reflect the advice in the 2nd paragraph of your answer, stripping out the Cox model from my OP, in the answer I just posted below. — Village.Idyot, Apr 24 '23 at 09:00
Hi EdM, in response to your comment about sampling from the joint distribution of shape and scale values, I attempted to do this with code in another post I just posted, https://stats.stackexchange.com/questions/614381/in-survival-curve-simulation-how-to-correctly-sample-from-the-joint-distributio — Village.Idyot, Apr 28 '23 at 09:46

Village.Idyot · Answer 2 · 2023-04-27T13:02:42.940

In the below code I reflect the advice in the 2nd paragraph of EdM's answer, "to generate simulations to illustrate the vagaries of fitting a parametric Weibull model" by removing the intermediate Cox model shown in the OP.

The code below takes into account the joint distribution of shape and scale values from the parametric Weibull fit by using the mvrnorm() function from the MASS package to simulate from a bivariate normal distribution with mean and covariance matrix equal to the estimates obtained from fitting the Weibull distribution. I use the vcov() function to extract the covariance matrix from the Weibull fit. The code generates new survival times using the simulated shape and scale parameters.

Code:

library(survival)
library(MASS)
Simulate survival data
n <- 100
cens <- rbinom(n, size = 1, prob = 0.2)
x1 <- rnorm(n)
x2 <- rnorm(n)
data <- data.frame(cens, x1, x2)
Estimate parameters of Weibull distribution
t <- rexp(n, rate = 1/5)
fit <- fitdistr(t, "weibull")
shape <- fit$estimate[1]
scale <- fit$estimate[2]
Generate new survival times based on estimated parameters
mean_est <- c(shape, scale)
cov_est <- vcov(fit)
new_params <- mvrnorm(n = n, mu = mean_est, Sigma = cov_est)
new_t <- rweibull(n = n, shape = new_params[,1], scale = new_params[,2])
Fit survival curve
fit <- survfit(Surv(new_t, cens) ~ 1, data = data)
sfit <- summary(fit)
tseq <- seq(min(sfit$time), max(sfit$time), length = 100)
surv <- survfit(Surv(new_t, cens) ~ 1, data = data.frame(new_t = tseq))
haz <- -diff(log(surv$surv))/diff(surv$time)
Plot survival and hazard curves, and histogram of simulated survival times
par(mfrow=c(1,3))
plot(surv, xlab = "Time", ylab = "Survival Probability", main = "Survival Function")
plot(surv$time[-1], haz, type = "l",xlab = "Time", ylab = "Hazard Function", main = "Hazard Function")
hist(new_t, breaks = 20, xlab = "Survival Time", ylab = "Frequency", main = "Simulated Survival Times")

It seems that you might be having some problems with Weibull parameterization. I'll post an answer to this question that should help with the several questions you've posted recently. — EdM, Apr 28 '23 at 13:35
Yes, thank you. I am a student of survival analysis and have quite a ways to go — Village.Idyot, Apr 28 '23 at 13:40
Actually I'll post my answer to this question. After that, to help maintain some efficiency on this site, please review your questions and delete those that aren't really at issue anymore. This site works best if there is just one question per page, but the relatively small differences among your recent questions don't seem so much like truly separate questions. — EdM, Apr 28 '23 at 13:43

Is this an adequate example for simulation and modeling uncertainty in survival and hazard curve estimates?

Simulate survival data

Fit Cox PH model

Generate random samples from multivariate normal distribution

Generate new survival times; new_time = simulated new survival times

Generate survival curves

Plot survival and hazard curves, and histogram of simulated survival times

Calculate average hazard rate per time period

2 Answers2

Simulate survival data

Estimate parameters of Weibull distribution

Generate new survival times based on estimated parameters

Fit survival curve

Plot survival and hazard curves, and histogram of simulated survival times