How to introduce uncertainty in fitting the original data when simulating survival curves?

Question

The code below simulates the uncertainty in the lognormal distribution parameters, using the MASS::mvrnorm() function and the lung dataset from the survival package. Although the parametric distribution providing the best fit is Weibull, for illustrative purposes I'm using the lognormal distribution.

When running the code, per the image at the bottom of this post, the solid green line shows the Kaplan-Meier curve (probabilities) of the lung data, the dashed-green lines the confidence interval surrounding the K-M probabilities, the red line the fitted survival curve for lung data using the lognormal distribution, and the dashed-blue lines show 5 simulation runs.

My question is, how could I introduce the inherent uncertainty in fitting the original data when running the simulation? In addition to the uncertainty currently simulated of the lognormal parameters. Note in the image the width of the 95% confidence intervals around the K-M curve. It seems that the simulation runs (dashed blue lines) should at least be as wide around the fitted lognormal survival curve (red line) as the 95% CI lines around the K-M curve.

Code:

library(MASS)
library(survival)
fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "lognormal")
time <- seq(0, 1000, by = 1)
meanlog <- fit$coef  # mean on the log scale
sdlog <- fit$scale  # standard deviation on the log scale
var_cov <- vcov(fit) #  extract the variance-covariance matrix
Compute the lognormal survival function
survival <- 1 - plnorm(time, meanlog = meanlog, sdlog = sdlog)
num_simulations <- 5
Generate random lognormal parameter estimates for simulations
sim_params <- MASS::mvrnorm(num_simulations, mu = c(meanlog, sdlog), Sigma = var_cov)
Compute the survival curves for each simulation
sim_curves <- sapply(1:num_simulations, function(i) 1 - plnorm(time, meanlog = sim_params[i, 1], sdlog = sim_params[i, 2]))
Compute the Kaplan-Meier survival curve for the lung dataset
lung_surv <- survfit(Surv(time, status) ~ 1, data = lung)
Plot the lognormal survival curve, simulation lines, and Kaplan-Meier plot
plot(time, survival, type = "l", xlab = "Time", ylab = "Survival Probability", 
     main = "Lognormal Survival Curve of Lung Dataset", col = "red", lwd = 2)
lapply(1:num_simulations, function(i) lines(time, sim_curves[, i], col = "blue", lty = "dashed"))
lines(lung_surv, col = "green")
Store the coordinates of the simulation lines
sim_lines <- lapply(1:num_simulations, function(i) {
  curve <- sim_curves[, i]
  lines(time, curve, col = "blue", lty = "dashed")
  return(data.frame(time = time, survival = curve))
})

Output of the above code:

By inherent uncertainty I mean the errors in fitting the lognormal distribution to the survival data, just as there are errors in fitting the K-M curve to the survival data. — Village.Idyot, May 10 '23 at 09:31
"It seems that the simulation runs (dashed blue lines) should at least be as wide around the fitted lognormal survival curve (red line) as the 95% CI lines around the K-M curve." -- I don't see why this should follow; the parametric assumption is a much stronger assumption,it allows a lot of data to contribute to each parameter estimate, since you have two parameters to estimate from a lot of points with the parametric fit, instead of a similar number of parameters as points with the nonparametric model. I'd expect the lognormal to have smaller intervals, typically. — Glen_b, May 10 '23 at 09:36
That said, there are two sources of uncertainty; there's (i) the uncertainty in parameter estimates (which will be asymptotically bivariate normal if they're MLEs, for example, so in fits based on large samples you might use that as an approximation for the difference between the original true parameter and the re-estimate, and so back out 'new' parameters to simulate with from that -- a form of parametric bootstrap approach), and (ii) the process uncertainty in the data generated from the resulting lognormal. — Glen_b, May 10 '23 at 09:40
OK thanks, I'll post an attempted answer (a parametric bootstrap approach) and perhaps the community will let me know whether I am FOS or not. — Village.Idyot, May 10 '23 at 09:50
@Village.Idyot maybe you could explain your problem by explaining what your goal is with these simulations. "the errors on fitting", what do you want to do with them? — Sextus Empiricus, May 10 '23 at 12:07
Goal is to perform a form of "time series forecasting simulation" but using survival analysis instead of traditional time series methods (ANOVA etc.). As an example, suppose we only had 500 periods of lung data - conservatively estimate potential simulation paths for remaining periods 501-1000 including extreme outliers. See https://stats.stackexchange.com/questions/614279/use-survival-analysis-to-predict-survival-probabilities-for-future-periods-not-c and https://stats.stackexchange.com/questions/614198/how-to-generate-multiple-forecast-simulation-paths-for-survival-analysis — Village.Idyot, May 11 '23 at 04:58
@Village.Idyot "estimate potential simulation paths" I don't know what that means and why one should do this. And also it doesn't seem like the ultimate goal; If I look up the information about the lung data, then I only see times, but not paths. Do you want to estimate paths or is it about something else that you want to do with these estimated paths? — Sextus Empiricus, May 11 '23 at 07:40

Sextus Empiricus · Accepted Answer · 2023-05-14T20:31:14.130

If you have some parametric function $F$ for a survival curve that predicts the time $T$ of some event

$$P(T\leq t | \theta_1,\theta_2) = F(t;\theta_1,\theta_2)$$

and the parameters themselves are random as well

$$\boldsymbol{\theta} \sim MVN(\boldsymbol{\mu}, \boldsymbol{\sigma})$$

then you can express the probability by integrating over all cases:

$$P(T\leq t) = \iint_{\forall \theta_1,\theta_2} P(T\leq t | \theta_1,\theta_2) f(\theta_1,\theta_2) d\theta_1 d\theta_2$$

Possibly this might be evaluated analytically or approximated. Your question seems to use the approach of simulations. In that case the result is the average of your simulated curves.

Computational example:

Let the waiting time for the event be exponential distributed

$$T \sim Exp(\lambda)$$

with a variable rate $\lambda$

$$\lambda \sim N(1,0.04)$$

The the survival curve can be computed as the average

$$S(t) = E[exp(-\lambda t)]$$

and this follows a log normal distribution (approximately because the case here is truncated at zero) with mean parameter $-\mu t$ and scale parameter $\sigma t$ thus we have

$$S(t) \approx exp(-\mu t + 0.5 \sigma^2 t^2)$$

(the formula brakes down for large $\sigma$ or $t$ when the approximation of the truncated distribution with a non-truncated distribution fails).

The simulation below shows that this approximation can work

set.seed(1)
generate data from exponential distribution
with variable rate
n = 10^5
lambda = rnorm(n,1,0.2)
t = rexp(n,lambda)
order data for plotting as
emperical survival curve
t = t[order(t)]
p = c(1:n)/n
plotting
plot(t, 1-p, ylab = "P(T<=t)", main = "emperical survival curve \n t ~ exp(lambda)\n lambda ~ N(1,0.04)", type = "l", log = "y")
compare two models
lines(t,exp(-t+0.2^2/2t^2), col = 4, lty = 2)
lines(t,exp(-(1)t), col = 2, lty = 2)

Village.Idyot · Answer 2 · 2023-05-31T05:29:28.070

In the code below is a solution that follows the characterization of lognormal survival $logT ∼ α + σW$ per resource https://grodri.github.io/survival/ParametricSurvival.pdf. Also see the plot beneath which illustrates 1000 simulations. Post How to simulate variability (errors) in fitting a gamma model to survival data by using a generalized minimum extreme value distribution in R? also has a discussion on randomizing values for $W$, $α$, and $σ$.

Code:

library(MASS)
library(survival)
time <- seq(0, 1000, by = 1)
fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "lognormal")
Compute the lognormal survival function using the fitted model
meanlog <- fit$coef  # mean on the log scale
sdlog <- fit$scale  # standard deviation on the log scale
Compute lognormal survival function for the base fitted model
survival <- 1 - plnorm(time, meanlog = meanlog, sdlog = sdlog)
Generate random values for simulations where survival form for lognormal is logT ∼ α + σW
simFX <- function(){
    W <- rnorm(165) # randomize W for model error
    newCoef <- MASS::mvrnorm(1, mu = c(meanlog, sdlog), Sigma = vcov(fit)) # randomize α and σ 
    newTimes <- exp(newCoef[1] + newCoef[2] * W) # apply survival form for lognormal logT ∼ α+ σW
    newFit <- survreg(Surv(newTimes)~1,dist="lognormal")
    params <- c(newFit$coef,newFit$scale)
    return(1 - plnorm(time, meanlog = params[1], sdlog = params[2]))
    }
plot(time,survival,type="n",xlab="Time",ylab="Survival Probability",main="Lung Survival (Lognormal)")
replicate(1000,lines(simFX(), col = "blue", lty = 2)) # run this line to add simulations to plot
lines(survival, type = "l", col = "yellow", lwd = 3) # plot base fitted survival curve

How to introduce uncertainty in fitting the original data when simulating survival curves?

Compute the lognormal survival function

Generate random lognormal parameter estimates for simulations

Compute the survival curves for each simulation

Compute the Kaplan-Meier survival curve for the lung dataset

Plot the lognormal survival curve, simulation lines, and Kaplan-Meier plot

Store the coordinates of the simulation lines

2 Answers2

generate data from exponential distribution

with variable rate

order data for plotting as

emperical survival curve

plotting

compare two models

Compute the lognormal survival function using the fitted model

Compute lognormal survival function for the base fitted model

Generate random values for simulations where survival form for lognormal is logT ∼ α + σW

Linked