How to simulate variability (errors) in fitting a gamma model to survival data by using a generalized minimum extreme value distribution in R?

Question

As shown below and per the R code at the bottom, I plot a base survival curve for the lung dataset from the survival package using a fitted gamma model (plot red line) with flexsurvreg() and run simulations (plot blue lines) by drawing on random samples for the $W$ random error term of the standard gamma distribution equation $Y=logT=α+W$ per section 1.5 of the Rodríguez notes at https://grodri.github.io/survival/ParametricSurvival.pdf. In drawing on random samples for $W$, I am trying to illustrate the fundamental variability imposed by fitting a model to the data by using a generalized standard minimum extreme value distribution per the Rodríguez notes, which states: "$W$ has a generalized extreme-value distribution with density...controlled by a parameter $k$ as excerpted below"

In the plot below, I run 5 simulations by repeatedly running the last line of code and as you can see the simulations are far from the fitted gamma curve in red. Those simulations should surround the fitted gamma, loosely following its shape. The best I've been able to do in generating random samples $W$ and loosely following the shape of the fitted gamma curve is by using rgamma() as shown in the below code and plotted below, and for getting closer to the fitted gamma curve although not following its curvature is rexp() (commented-out in the code) which is a bit nonsensical and neither of these two are generalized extreme values either. I also try rgpd() from the evd package (commented-out) for generating random samples from a generalized extreme value distribution, and those results are nonsensical. Similar bad outcomes using evd::rgev().

What am I doing wrong in my interpretation of sampling from the generalized extreme-value distribution? And how do I get the simulations to form a broad band around the gamma curve?

Code:

library(evd)
library(flexsurv)
library(survival)
time <- seq(0, 1000, by = 1)
fit <- flexsurvreg(Surv(time, status) ~ 1, data = lung, dist = "gamma")
shape <- exp(fit$coef["shape"])
rate <- exp(fit$coef["rate"])
survival <- 1-pgamma(time, shape = shape, rate = rate)
Generate random distribution parameter estimates for simulations
simFX <- function(){
    W <- log(rgamma(100,shape = shape, scale = 1/rate))
    # W <- log(rexp(100, rate))
    # w <- log(evd::rgpd(100,shape = shape, scale = 1/rate))
    newTimes <- exp(log(shape) + W)
    newFit <- flexsurvreg(Surv(newTimes) ~ 1, data = lung, dist = "gamma")
    newFitShape <- exp(newFit$coef["shape"])
    newFitRate <- exp(newFit$coef["rate"])
    return(1-pgamma(time, shape=newFitShape, rate=newFitRate))
  }
plot(time,survival,type="n",xlab="Time",ylab="Survival Probability", main="Lung Survival (gamma)")
lines(survival, type = "l", col = "red", lwd = 3) # plot base fitted survival curve
lines(simFX(), col = "blue", lty = 2) # run this line repeatedly for SIMULATION

Edit: [There was previously an Edit 1 reflecting an error and Edit 2 correcting Edit 1. Edit 1 has been deleted and Edit 2 switched to the final "Edit"]. This Edit adds a Weibull example showing the simulation of $W$ when characterizing Weibull by $Y = log T = α + σW$, where $W$ has the extreme value distribution, $α = − logλ$ and $p = 1/σ$. This example generates samples by: (1) randomizing $W$, (2) combining the randomized $W$ with the estimates of $α$ and $σ$ from the model fit to the lung data following the above equation form, (3) exponentiating the last item per the above equation form to get survival-time simulations (note in newTimes <- exp(fit$icoef[1] + exp(fit$icoef[2])*W) of the code below, how $σ$ (fit$icoef[2]) is exponentiated since in its native form it is in the log scale) and the whole thing $α+σW$ is exponentiated again as a unit), (4) running each new survreg() fit. Code:

library(survival)
time <- seq(0, 1000, by = 1)
fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "weibull")
weibCurve <- function(time, survregCoefs){exp(-(time/exp(survregCoefs[1]))^exp(-survregCoefs[2]))}
survival <- weibCurve(time, fit$icoef)
simFX <- function(){
  W <- log(rexp(165))
  newTimes <- exp(fit$icoef[1] + exp(fit$icoef[2])*W)
  newFit <- survreg(Surv(newTimes)~1,dist="weibull")
  params <- c(newFit$icoef[1],newFit$icoef[2])
  return(weibCurve(time, params))
}
plot(time,survival,type="n",xlab="Time",ylab="Survival Probability", 
     main="Lung Survival (Weibull) by sampling from W extreme-value distribution")
invisible(replicate(500,lines(simFX(), col = "blue", lty = 2))) # run this line to add simulations to plot
lines(survival, type = "l", col = "yellow", lwd = 3) # plot base fitted survival curve

Illustration of running above simFX() 500 times (using replicate() function):

You are incorporating $\alpha$ and $\sigma$ from the Weibull fit to lung (in the $\log T = \alpha + \sigma W$ form for Weibull) at the wrong point. The Weibull example shows new fits via survreg() only to samples from a standard minimum extreme value distribution and makes an additive assumption in params <- fitCoef + fitW$icoef. To mimic new samples, you need first to incorporate $W$ along with the estimates of $\alpha$ and $\sigma$ from the model fit to the lung data, and exponentiate to get survival-time simulations. Only then do each new survreg() fit, and plot directly. — EdM, May 29 '23 at 20:09
I attempted to incorporate your recommendation in Edit 2. In the plot beneath the implementation code, it doesn't look quite right although not too far off. Am I doing anything wrong in my implementation for your recommendation? — Village.Idyot, May 30 '23 at 08:12
fit$icoef[2] is $\log \sigma$. Also, the variability is a function of how many events in the data you fit. The lung data has 165 events, while you are only sampling 100 in your code. — EdM, May 30 '23 at 13:07
Does my revised Edit 2 look ok? I can delete Edit 1 if Edit 2 is ok in order to shorten this post. I hope Edit 2 appropriately simulates $W$ while holding $α$ and $σ$ from the original fit constant. Does it make sense to simulate $W$ in this manner, and also use MASS:mvrnorm() to additionally simulate the $α$ and $σ$ parameters? Adding more dispersion and more "extremity" to the analysis. — Village.Idyot, May 30 '23 at 14:17
Few things make my head hurt more than trying to reconcile all of the parameterizations of survival models. All seems OK now in your Edit 2 (although imposing transparency on the individual curves, if that's possible with R on your computer, might help show the distribution among the 500 curves). How much extra dispersion to add depends on what you are trying to illustrate. For example, including all sources of variability might be wise for simulating data to inform a new clinical trial. — EdM, May 30 '23 at 14:22

EdM · Accepted Answer · 2023-05-29T21:16:50.103

Ultimately, if you have the same reasonably large number of events and the underlying parametric model is adequate, the distribution of coefficient estimates from repeated sampling of the underlying distribution should be close to the multivariate normal distribution from the original fit.

There's no need to put this in the $\log T = \alpha + W$ accelerated failure time format. Once you have fixed shape and rate parameters in the form expected by pgamma() and rgamma(), just sample directly from rgamma() for event-time simulations. With your shape and rate values from the fit to the lung data set, to mimic the 165 events in that data set and overlay onto the Kaplan-Meier curve:

plot(survfit(Surv(time, status) ~ 1, data = lung))
newGammaData <- rgamma(165,shape=shape,rate=rate)
newGammaFit <- flexsurvreg(Surv(newGammaData, rep(1,165))~1,dist="gamma") 
lines(1-pgamma(time, exp(newGammaFit$coef["shape"]), exp(newGammaFit$coef["rate"])),col="red")

Repeat the last 3 lines as often as you'd like.

To compare against the variability of coefficient estimates from the original model, sample from the bivariate normal distribution of the coefficients as they are reported by flexsurvreg(), exponentiate (as you do) to get values that are expected by pgamma(), then add lines as you show. Try the following:

newCoef <- MASS::mvrnorm(1,coef(fit),vcov(fit))
lines(1-pgamma(time,exp(newCoef[["shape"]]),exp(newCoef[["rate"]])),col="blue")

and repeat both code lines as often as you would like.

Direct comparison of covariances among fits to simulated data and covariance matrix of coefficient estimates

To illustrate how well the asymptotically bivariate normal distribution of coefficient estimates corresponds to what you find by multiple fits to data drawn from the modeled survival distribution, get the gamma fit to the original data (with 165 events), then combine fit results from 200 samples (of 165 uncensored simulated event times each) from the modeled gamma fit.

## fit lung data
library(flexsurv)
lungGammaFit <- flexsurvreg(Surv(time, status) ~ 1, data = lung, dist = "gamma")
## get coefficients in form for rgamma()
shape <- exp(lungGammaFit$coef["shape"])
rate <- exp(lungGammaFit$coef["rate"])
## initialize for accumulation over reps
bar <- rep(0,2) ## to keep coefficient estimates
cov <- matrix(0,nrow=2,ncol=2) ## to keep covariances
set.seed(104)
## do 200 samples of 165 each and fit
for (fitCount in 1:200) {newData <- rgamma(165,shape=shape,rate=rate);
      newFit <- flexsurvreg(Surv(newData, rep(1,165))~1,dist="gamma");
      cof<-coef(newFit);
      bar <- bar+cof;
      cof<-as.matrix(cof);
      cov<-cov+cof%*%t(cof)}
## take average over 200 reps
(barAve <- bar/200)
#     shape       rate 
# 0.4001778 -5.5644051 
## expected good agreement with original fit
coef(lungGammaFit)
#     shape       rate 
# 0.3908883 -5.5839805
## multivariate coefficient covariance estimate
barAve <- as.matrix(barAve)
newVcov <- (cov-200*barAve %*% t(barAve))/199
newVcov
#            shape       rate
# shape 0.01090990 0.01145029
# rate  0.01145029 0.01592581
## even covariances are very close
vcov(lungGammaFit)
#             shape       rate
# shape 0.009105897 0.01050332
# rate  0.010503320 0.01603337

That illustrates the first paragraph of the answer: the variability among fits to multiple samples from the gamma-distribution fit to the lung data is essentially what you would get from the original variance-covariance matrix of coefficient estimates, if the number of events is the same.

That works for simulating variability in $α$ coefficients. What I am interested in for now however is the variability in $W$ but I may be barking up the wrong tree. I thought I was heading there with W <- log(evd::rgpd(100,shape = shape, scale = 1/rate)) and newTimes <- exp(log(shape) + W), where log(shape) is the fixed $α$ from fitting to gamma. Can you point me to an example of modeling $W$ correctly for any distribution if not gamma? Plotting the elements that survive (or don't) as a histogram for example to illustrate inter-individual variability instead of survival curves. — Village.Idyot, May 29 '23 at 18:02
@Village.Idyot we did this for minimum extreme value here. If you have fixed parameter values, why not just sample event times directly from rgamma() with those parameter values instead of forcing this into the $\log T = \alpha +W$ form? — EdM, May 29 '23 at 18:13
You should get pretty similar results either way, if you have the same number of events. I've edited to show how to do it either way. — EdM, May 29 '23 at 18:53
I edited the OP before seeing your last comment and edited answer. However, without the benefit of reviewing your edits and last comment, I may have been completely "off" in my implementation of modeling the variability of $W$. In the Weibull example I added to the OP, am I modeling $W$ (which is what I thought I was doing) or only the coefficients? — Village.Idyot, May 29 '23 at 19:30

How to simulate variability (errors) in fitting a gamma model to survival data by using a generalized minimum extreme value distribution in R?

Generate random distribution parameter estimates for simulations

1 Answers1

Linked