Simulating survival data to reflect existing complex dataset and censoring proportions

Question

I am trying to compare the impact of different censoring proportions on different survival analysis models. For that, I plan to do a simulation study and then apply the best model on real-life child mortality data to find the determinants of mortality.

The simulated data should reflect the pattern of effects seen in child mortality data. My data set consists of 32 covariates. I’m having trouble understanding how to simulate the data using the simsurv package in R. I have gone through so many research papers but none of them actually gave me a clear picture of how the data must be simulated. I failed to understand how to include the covariates in the simulation.

Can someone please explain to me how I should simulate the data to reflect the covariates and the censoring proportions? From my understanding, the 2-component Weibull mixture model is a good fit for complex data like this.

I was told that I should first decide on a survival function that best fits the data by fitting a parametric survival model (Weibull). Then compare with a 2-component Weibull mixture model to see which model better fits the data. But where do I incorporate covariates and censoring proportions? Is my understanding wrong?

Do you also want to simulate the covariates? Otherwise, the package vignette, https://cran.r-project.org/web/packages/simsurv/vignettes/simsurv_usage.html, seems to provide examples — Ute, Jul 02 '23 at 10:20
@Ute No only the survival times needs to be simulated. Thank you for the help. — Nipuni Opatha, Jul 02 '23 at 10:48
I think you are right, it is not so clear how to simulate censoring. Did you see this post: https://stats.stackexchange.com/questions/522861/how-to-simulate-survival-data-with-censoring-using-r — Ute, Jul 02 '23 at 11:01
@Ute yes they do not specify anything regarding the censoring proportions that's why I was confused. I will look up this post, thank you. — Nipuni Opatha, Jul 02 '23 at 11:35

score 0 · Accepted Answer · answered Jul 02 '23 at 15:31

Matching a particular censoring fraction has to be done by trial and error.

You do all the other simulations (covariate values if they are being modeled, uncensored event times based on covariate values) first. Simulation methods typically use a large survival-time cutoff to avoid integration out to very long times, so any cases with event times that would have been simulated beyond that cutoff are automatically taken as right censored, at that cutoff time to start. Their censoring times might ultimately be even earlier, if the censoring time modeled later is even earlier than that cutoff time.

As censoring should be uninformative, you then choose a model for censoring that is independent of covariate values. That can be based on any probability distribution that's defined only over non-negative time values. You might choose a uniform or exponential or Weibull distribution of censoring times, with some particular starting values for parameters of the censoring model.

Then line up the uncensored event times and the censoring times. If you have an appropriate fraction of right-censored event times (cases whose event times weren't even estimated as they exceeded the simulation cutoff, plus cases whose simulated censoring time is earlier than the simulated event time), then you are all set. If not, adjust the parameters of the censoring distribution and try, try again until you match the desired censoring fraction. You probably won't get the exact censoring fraction that you want, but you should be able to get close.

This page and its links might also be helpful.

Simulating survival data to reflect existing complex dataset and censoring proportions

1 Answers1

Linked