Guessing What Data Was Observed Based On The Model?

Question

Suppose I fit a Survival Cox-PH Regression Model in R and get the following results:

Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)
         coef exp(coef)  se(coef)      z        p

age      0.011067  1.011128  0.009267  1.194 0.232416
sex     -0.552612  0.575445  0.167739 -3.294 0.000986
ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05
Likelihood ratio test=30.5  on 3 df, p=1.083e-06
n= 227, number of events= 164 
   (1 observation deleted due to missingness)

Based on these results, I can infer information such as:

The number of observations
The number of events
The number of variables
The estimate for the effect of each variable

My Question: Given this information, is it possible to simulate the covariate and response information for n = 227 such observations - such that if a similar Cox-PH model was fit to these newly simulated 227 observations, the resulting regression coefficients would approximately be equal to the original regression coefficients? Can I try to guess (and recreate) observations might have been observed based on the regression model coefficients?

For example, I know that if I were to "fix" the covariate information for a group of n = 227 "arbitrary created" patients, I could then simulate their survival times (e.g. https://cran.r-project.org/web/packages/simsurv/index.html) - however, if I were to then fit a Cox-PH model to these observations, the model coefficients would not necessarily be close to the original model coefficients.

In general, is this possible to do? Only given the above model summary, could I try and somehow generate the original dataset that this model was trained on?

Thanks!

Note: I realize there are probably an infante number of n = 227 samples that can be randomly simulated such that a Cox-PH Model produces the same regression coefficient estimates as above.

I want to comment here that if someone goes full Bayesian, they could sample from the posterior distributions to create very plausible datasets (given that model was correctly specified, estimated, etc.). — usεr11852, Feb 12 '23 at 01:23

score 2 · Answer 1 · answered Feb 08 '23 at 13:10

2

Given this information, is it possible to simulate the covariate and response information for n = 227 such observations - such that if a similar Cox-PH model was fit to these newly simulated 227 observations, the resulting regression coefficients would approximately be equal to the original regression coefficients?

Yes. Earlier, you asked a question about how to simulate survival times. Simply use the method described there but specify the effects of covariates to be what you see in output of your summary call.

answered Feb 08 '23 at 13:10

Demetri Pananos

36,121

@ Demetri Pananos: Thank you for your answer! What I meant by this question, I was interested in knowing if it is possible to "guess" the covariate information belonging to the patients that this model was built on. – stats_noob Feb 08 '23 at 17:52
@stats_noob do you mean by 'guess' something like estimate whether the ages had been between 0-10 years or between 50-60 years? That's information which you can not obtain from that output since the output is the same regardless of how the input is shifted. – Sextus Empiricus Feb 09 '23 at 09:17
1

In addition, the input space has 3x227 dimensions and the output has only 3 dimensions (or 6 if you include the standard errors of the coefficients). It is like estimating a sample based on it's mean. You can not get to know the values of the individuals. But, you can generate a sample that corresponds to the output. Is that what you mean? – Sextus Empiricus Feb 09 '23 at 09:22
@ Sextus Empiricus: thank you for your reply! I think thats what I had in mind : can I try to reverse engineer the medical information (e.g. age, sex, ph.ecog) and survival times of each patient in this dataset? – stats_noob Feb 09 '23 at 09:26

score 1 · Answer 2 · answered Feb 13 '23 at 16:40

In general, is this possible to do? Only given the above model summary, could I try and somehow generate the original dataset that this model was trained on?

You'll need additional information about the data to get the distribution of the covariates correct. Even then, you won't match the data exactly, but you'll get closer.

There are answer on Stack Exchange that explain how to simulate survival times from a Cox-PH model. See here, here, and here. Each of these answers use the method in Blender et al (2005) to simulate survival times: invert the cumulative hazard function and plug in uniform random variables.

Your simulation will include the correct effect of each covariate, but you won't be able to know the distribution of the covariates. You know the log hazard of sex is $-0.55$, but you don't know if this was estimated from 10, 50, 100, 200, etc men. You won't know the distribution of age either. Was the study restricted to 50-70 year olds? What were the ages of each person. You can usually find a demographics table that lists the distributions of important variables. If you include this in your simulations, you'll be able to more accurately reproduce survival times.

This additional information will only get you so far. You'll never be able to reproduce the original data exactly. Like you said, there are many datasets that will produce identical or nearly identical parameters.

Guessing What Data Was Observed Based On The Model?

2 Answers2