Correctly simulating an extreme value distribution for survival analysis?

Question

In the image and per the code at the bottom of this post, I plot survival curves for the lung dataset from the survival package using a fitted exponential distribution model (plot red line), using the K-M nonparametric model (plot blue lines), and run/show simulations using the exponential model (plot light blue lines) with the mean of the simulations shown as the black line. Exponential doesn't provide the best fit for lung data but I'm trying to better understand modeling with exponential and extreme value distributions.

The model fit takes the form $log(T)$∼$β_0 + W$ where $β_0$ is the fit$coef and $W$ represents a standard minimum extreme value distribution (exponential in this case). In the example presented herein I only model $W$; in related posts I address the modeling of $β_0$ such as in How to appropriately model the uncertainty of the exponential distribution model when running survival simulations?. The simulations herein are, for a change, from the perspective of individuals, modeled by generating random intercepts for the exponential distribution in the line of code simPaths <- sapply(1:simNbr,function(i) 1-pexp(time,rate=1/rexp(1,rate = 1/exp(fit$icoef)))).

My question is why does the mean of the simulations shown in the black line differ so widely differ from the fitted base exponential model shown in the red line? When I simulated only the $β_0$ uncertainty in related posts, the simulations formed a band around the fitted base distribution similar in width more or less to the 95% CI around the Kaplan-Meier curve (the dashed blue lines). What am I doing wrong?

Code:

library(survival)
simNbr <- 25
time <- seq(0, 1000, by = 1)
fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "exponential")
Compute exponential survival function for the base fitted model
survival <- 1 - pexp(time, rate = 1/exp(fit$coef))
Compute the survival curves for each simulation
simPaths <- sapply(1:simNbr,function(i) 1-pexp(time,rate=1/rexp(1,rate = 1/exp(fit$icoef))))
plot(time,survival,type="n",xlab="Time",ylab="Survival Probability",main="Lung Data Survival Plot")
Plot simulations
plotSims <- data.frame(time = time, 
  do.call(cbind, 
          lapply(1:simNbr,function(i) {
                 lines(time, simPaths[, i], col = "lightblue", lty = "solid", lwd = 0.25)
                 return(curve)
                }
          )
        )
)
Add average of simulations
avgSurvival <- apply(simPaths, 1, mean)
lines(time, avgSurvival, col = "black", lwd = 3)
Add Kaplan-Meier survival curve for the lung data
lines(survfit(Surv(time, status) ~ 1, data = lung), col = "blue", lwd = 1)
Plot the base fitted survival curve using exponential
lines(cbind(time, survival), type = "l", col = "red", lwd = 3)
legend("topright", 
       legend = c("Fitted exponential model","K-M & confidence intervals","Simulations", "Simulation mean"), 
       col = c("red", "blue", "lightblue", "black"),lwd = c(3, 1, 0.25, 3),lty = c(1, 1, 1, 1), 
       bty = "n")

This seems to be a coding problem appropriate for Stack Overflow rather than a statistical question appropriate for this site. That said, I don't see the basis for the random sampling in your simPaths code line; it's not sampling from the normal distribution of fit$coef so far as I can see. — EdM, May 17 '23 at 12:44
I may have misinterpreted your suggestion in https://stats.stackexchange.com/questions/615657/how-to-appropriately-model-the-uncertainty-of-the-exponential-distribution-model, where you simulate the density by drawing random samples for $W$ in your line of code rexp(300, rate=1/exp(6.044474)). Or maybe I did not misinterpret; what do you think? In any case I will try SO in case this is a coding problem (they'll probably kick me right back here to CV). — Village.Idyot, May 17 '23 at 14:50
My simulations there were of event times, not model parameters. Technically, I wasn't sampling directly from a standard minimum extreme value distribution $W$ there; I took advantage of the simplicity of the exponential model (with scale $\sigma$ in the term $\sigma W$ fixed at 1) to sample directly from an exponential survival distribution with a fixed rate parameter. This page (eventually) shows a correct way to sample from a minimum extreme value distribution for this general type of parametric survival model. — EdM, May 17 '23 at 14:57
OK, by "eventually" you must mean at the end of the post the user shows the corrected code after his back and forth with you looking for a solution. I like how his exp(b0+b1*x1+b2*x2+c*error) directly follows the formula notation. — Village.Idyot, May 17 '23 at 15:11
I adopted the method described in the post you referenced, in my answer below. Hopefully this is OK or at least closer to what you recommend, for modeling the variability in $W$. — Village.Idyot, May 18 '23 at 13:05
In your citation https://stats.stackexchange.com/questions/591943/simulate-a-weibull-regression-model about simulating a Weibull regression model, I see that survreg(Surv(time, status==1)...dist="exponential") is used. Why is "exponential" used, and rexp() used in that post, when the question is about simulating a Weibull model? Further, would the same references to "exponential" and rexp() be used when simulating $W$ for not only Weibull but other distributions too such as gamma and lognormal? — Village.Idyot, May 19 '23 at 12:32
A standard exponential survival function can be written $\log T= W$, where $T$ is time and $W$ is distributed as standard minimum extreme value. So if you sample $T$ randomly from a standard exponential model and take the logs, you end up sampling randomly from a standard minimum extreme value distribution. In a general Weibull model, you start with a standard minimum extreme value and scale it by a factor $\sigma$ in the overall form $\log T=\eta + \sigma W$, where $\eta$ is the linear predictor. Sample from $W$ standard normal for log-normal, from $W$ logistic for log-logistic. — EdM, May 19 '23 at 12:47
Standard Gamma models in the above form use $W$ distributed as generalized minimum extreme value, with $\sigma= 1$. The Rodríguez course notes explain how to use the R pgamma() function to sample directly from a gamma-distributed model. Study those notes carefully and you will come to understand this much better. — EdM, May 19 '23 at 12:56
I see, per where you've pointed me and other users before: https://grodri.github.io/survival/ParametricSurvival.pdf — Village.Idyot, May 19 '23 at 13:46
I continue to find those course notes to be one of the most useful, succinct summaries of parametric survival modeling. — EdM, May 19 '23 at 14:05

score 0 · Accepted Answer · answered May 18 '23 at 13:03

I studied Simulate a Weibull regression model and my key takeaway is this (as modified slightly by me since I am trying to model $W$ in the model equation described in the OP):

W <- log(rexp(1000))
survreg(Surv(exp(W))~1,dist="exponential")

Running the above results in an intercept for $β_0$ near 0. Applying this method to my OP question and code, I get the output illustrated below with the code at the bottom. The code section that adopts the above is in simParams: W <- log(rexp(100)), fit <- survreg(Surv(exp(W))~1,dist="exponential"), and params <- coefLung + fit$icoef.

Though this is visually pleasing to my novice eye, a doubt I have is in rexp(100), where I set the 100 arbitrarily. A greater number of samples results in less dispersion, and a lesser number of samples the opposite. Is there an accepted method for setting the number of samples? Perhaps I should have used the number of elements (228) in the lung dataset? Maybe this is better addressed in another post.

Code:

simNbr <- 1000
time <- seq(0, 1000, by = 1)
fitLung <- survreg(Surv(time, status) ~ 1, data = lung, dist = "exponential")
coefLung <- fitLung$icoef
Compute exponential survival function for the base fitted model
survival <- 1 - pexp(time, rate = 1/exp(coefLung))
Generate random distribution parameter estimates for simulations
simParams <- sapply(
  1:simNbr,
  function(i){
    W <- log(rexp(100)) # note the number of random values which has a large impact on dispersion
    fit <- survreg(Surv(exp(W))~1,dist="exponential")
    params <- coefLung + fit$icoef
    return(as.vector(params))
    }
)
Compute the survival curve for each simulation
simPaths <- sapply(
  1:simNbr, 
  function(i) 1 - pexp(time, rate = 1 / exp(simParams[i]))
)
Set up plot shell
plot(time,survival,type="n",xlab="Time",ylab="Survival Probability",main="Lung Data Survival Plot")
Plot simulations
plotSims <- data.frame(
  time = time, 
  do.call(cbind, 
          lapply(1:simNbr,function(i) {
            lines(time, simPaths[, i], col = "lightblue", lty = "solid", lwd = 0.25)
            return(curve)
            }
           )
         )
)
Add average of simulations
avgSurvival <- apply(simPaths, 1, mean)
lines(time, avgSurvival, col = "black", lwd = 1)
Add Kaplan-Meier survival curve for the lung data
lines(survfit(Surv(time, status) ~ 1, data = lung), col = "blue", lwd = 1)
Plot the base fitted survival curve using exponential
lines(cbind(time, survival), type = "l", col = "red", lwd = 3)
legend("topright", 
       legend = c("Fitted exponential model","K-M & confidence intervals","Simulations", "Simulations mean"), 
       col = c("red", "blue", "lightblue", "black"),lwd = c(3, 1, 0.25, 3),lty = c(1, 1, 1, 1), 
       bty = "n"
       )

Models fit by maximum likelihood have asymptotically normal distributions of coefficient estimates as the number of cases increases to infinity. Small samples necessarily have more variance in estimates and can even have some bias. The choice of the number of cases to simulate depends on your purpose in simulation. For example, in designing a clinical trial, you might simulate a very large population to start, and then vary the number of cases sampled to see the sample size that provides adequate power. — EdM, May 18 '23 at 13:19

Correctly simulating an extreme value distribution for survival analysis?

Compute exponential survival function for the base fitted model

Compute the survival curves for each simulation

Plot simulations

Add average of simulations

Add Kaplan-Meier survival curve for the lung data

Plot the base fitted survival curve using exponential

1 Answers1

Compute exponential survival function for the base fitted model

Generate random distribution parameter estimates for simulations

Compute the survival curve for each simulation

Set up plot shell

Plot simulations

Add average of simulations

Add Kaplan-Meier survival curve for the lung data

Plot the base fitted survival curve using exponential

Linked