2

The code below simulates the uncertainty in the lognormal distribution parameters, using the MASS::mvrnorm() function and the lung dataset from the survival package. Although the parametric distribution providing the best fit is Weibull, for illustrative purposes I'm using the lognormal distribution.

When running the code, per the image at the bottom of this post, the solid green line shows the Kaplan-Meier curve (probabilities) of the lung data, the dashed-green lines the confidence interval surrounding the K-M probabilities, the red line the fitted survival curve for lung data using the lognormal distribution, and the dashed-blue lines show 5 simulation runs.

My question is, how could I introduce the inherent uncertainty in fitting the original data when running the simulation? In addition to the uncertainty currently simulated of the lognormal parameters. Note in the image the width of the 95% confidence intervals around the K-M curve. It seems that the simulation runs (dashed blue lines) should at least be as wide around the fitted lognormal survival curve (red line) as the 95% CI lines around the K-M curve.

Code:

library(MASS)
library(survival)

fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "lognormal")

time <- seq(0, 1000, by = 1) meanlog <- fit$coef # mean on the log scale sdlog <- fit$scale # standard deviation on the log scale var_cov <- vcov(fit) # extract the variance-covariance matrix

Compute the lognormal survival function

survival <- 1 - plnorm(time, meanlog = meanlog, sdlog = sdlog)

num_simulations <- 5

Generate random lognormal parameter estimates for simulations

sim_params <- MASS::mvrnorm(num_simulations, mu = c(meanlog, sdlog), Sigma = var_cov)

Compute the survival curves for each simulation

sim_curves <- sapply(1:num_simulations, function(i) 1 - plnorm(time, meanlog = sim_params[i, 1], sdlog = sim_params[i, 2]))

Compute the Kaplan-Meier survival curve for the lung dataset

lung_surv <- survfit(Surv(time, status) ~ 1, data = lung)

Plot the lognormal survival curve, simulation lines, and Kaplan-Meier plot

plot(time, survival, type = "l", xlab = "Time", ylab = "Survival Probability", main = "Lognormal Survival Curve of Lung Dataset", col = "red", lwd = 2) lapply(1:num_simulations, function(i) lines(time, sim_curves[, i], col = "blue", lty = "dashed")) lines(lung_surv, col = "green")

Store the coordinates of the simulation lines

sim_lines <- lapply(1:num_simulations, function(i) { curve <- sim_curves[, i] lines(time, curve, col = "blue", lty = "dashed") return(data.frame(time = time, survival = curve)) })

Output of the above code: enter image description here

  • "inherent uncertainty" what do you mean by this? – Sextus Empiricus May 10 '23 at 09:17
  • By inherent uncertainty I mean the errors in fitting the lognormal distribution to the survival data, just as there are errors in fitting the K-M curve to the survival data. – Village.Idyot May 10 '23 at 09:31
  • 1
    "It seems that the simulation runs (dashed blue lines) should at least be as wide around the fitted lognormal survival curve (red line) as the 95% CI lines around the K-M curve." -- I don't see why this should follow; the parametric assumption is a much stronger assumption,it allows a lot of data to contribute to each parameter estimate, since you have two parameters to estimate from a lot of points with the parametric fit, instead of a similar number of parameters as points with the nonparametric model. I'd expect the lognormal to have smaller intervals, typically. – Glen_b May 10 '23 at 09:36
  • 1
    That said, there are two sources of uncertainty; there's (i) the uncertainty in parameter estimates (which will be asymptotically bivariate normal if they're MLEs, for example, so in fits based on large samples you might use that as an approximation for the difference between the original true parameter and the re-estimate, and so back out 'new' parameters to simulate with from that -- a form of parametric bootstrap approach), and (ii) the process uncertainty in the data generated from the resulting lognormal. – Glen_b May 10 '23 at 09:40
  • OK thanks, I'll post an attempted answer (a parametric bootstrap approach) and perhaps the community will let me know whether I am FOS or not. – Village.Idyot May 10 '23 at 09:50
  • @Village.Idyot maybe you could explain your problem by explaining what your goal is with these simulations. "the errors on fitting", what do you want to do with them? – Sextus Empiricus May 10 '23 at 12:07
  • Goal is to perform a form of "time series forecasting simulation" but using survival analysis instead of traditional time series methods (ANOVA etc.). As an example, suppose we only had 500 periods of lung data - conservatively estimate potential simulation paths for remaining periods 501-1000 including extreme outliers. See https://stats.stackexchange.com/questions/614279/use-survival-analysis-to-predict-survival-probabilities-for-future-periods-not-c and https://stats.stackexchange.com/questions/614198/how-to-generate-multiple-forecast-simulation-paths-for-survival-analysis – Village.Idyot May 11 '23 at 04:58
  • @Village.Idyot "estimate potential simulation paths" I don't know what that means and why one should do this. And also it doesn't seem like the ultimate goal; If I look up the information about the lung data, then I only see times, but not paths. Do you want to estimate paths or is it about something else that you want to do with these estimated paths? – Sextus Empiricus May 11 '23 at 07:40

2 Answers2

3

If you have some parametric function $F$ for a survival curve that predicts the time $T$ of some event

$$P(T\leq t | \theta_1,\theta_2) = F(t;\theta_1,\theta_2)$$

and the parameters themselves are random as well

$$\boldsymbol{\theta} \sim MVN(\boldsymbol{\mu}, \boldsymbol{\sigma})$$

then you can express the probability by integrating over all cases:

$$P(T\leq t) = \iint_{\forall \theta_1,\theta_2} P(T\leq t | \theta_1,\theta_2) f(\theta_1,\theta_2) d\theta_1 d\theta_2$$

Possibly this might be evaluated analytically or approximated. Your question seems to use the approach of simulations. In that case the result is the average of your simulated curves.


Computational example:

Let the waiting time for the event be exponential distributed

$$T \sim Exp(\lambda)$$

with a variable rate $\lambda$

$$\lambda \sim N(1,0.04)$$

The the survival curve can be computed as the average

$$S(t) = E[exp(-\lambda t)]$$

and this follows a log normal distribution (approximately because the case here is truncated at zero) with mean parameter $-\mu t$ and scale parameter $\sigma t$ thus we have

$$S(t) \approx exp(-\mu t + 0.5 \sigma^2 t^2)$$

(the formula brakes down for large $\sigma$ or $t$ when the approximation of the truncated distribution with a non-truncated distribution fails).

The simulation below shows that this approximation can work

example

set.seed(1)

generate data from exponential distribution

with variable rate

n = 10^5 lambda = rnorm(n,1,0.2) t = rexp(n,lambda)

order data for plotting as

emperical survival curve

t = t[order(t)] p = c(1:n)/n

plotting

plot(t, 1-p, ylab = "P(T<=t)", main = "emperical survival curve \n t ~ exp(lambda)\n lambda ~ N(1,0.04)", type = "l", log = "y")

compare two models

lines(t,exp(-t+0.2^2/2t^2), col = 4, lty = 2) lines(t,exp(-(1)t), col = 2, lty = 2)

1

In the code below is a solution that follows the characterization of lognormal survival $logT ∼ α + σW$ per resource https://grodri.github.io/survival/ParametricSurvival.pdf. Also see the plot beneath which illustrates 1000 simulations. Post How to simulate variability (errors) in fitting a gamma model to survival data by using a generalized minimum extreme value distribution in R? also has a discussion on randomizing values for $W$, $α$, and $σ$.

Code:

library(MASS)
library(survival)

time <- seq(0, 1000, by = 1) fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "lognormal")

Compute the lognormal survival function using the fitted model

meanlog <- fit$coef # mean on the log scale sdlog <- fit$scale # standard deviation on the log scale

Compute lognormal survival function for the base fitted model

survival <- 1 - plnorm(time, meanlog = meanlog, sdlog = sdlog)

Generate random values for simulations where survival form for lognormal is logT ∼ α + σW

simFX <- function(){ W <- rnorm(165) # randomize W for model error newCoef <- MASS::mvrnorm(1, mu = c(meanlog, sdlog), Sigma = vcov(fit)) # randomize α and σ newTimes <- exp(newCoef[1] + newCoef[2] * W) # apply survival form for lognormal logT ∼ α+ σW newFit <- survreg(Surv(newTimes)~1,dist="lognormal") params <- c(newFit$coef,newFit$scale) return(1 - plnorm(time, meanlog = params[1], sdlog = params[2])) } plot(time,survival,type="n",xlab="Time",ylab="Survival Probability",main="Lung Survival (Lognormal)") replicate(1000,lines(simFX(), col = "blue", lty = 2)) # run this line to add simulations to plot lines(survival, type = "l", col = "yellow", lwd = 3) # plot base fitted survival curve

enter image description here