Bootstrapping method survival model interpretation

Question

This was one the [question][1] which was helped and suggested by [EdM][2] how to do it in R way.

So far what I have done is this

The dataset6 is my complete dataset which include all my genes and the Scores of the two modules which I have already calculated based on the regression coefficient already obtained.

 dataset6 = inner_join(surv_obj1,dataset5) %>% cbind(M1[c(32)]) %>% cbind(M2[c(42)])
 dim(dataset6)

##Modules : all the genes and the module scores

M1 <- read.csv("Module_turq.txt",sep = "\t")
names(M1)[32] = "turquoise"
M2 <- read.csv("module_red.txt",sep = "\t")
names(M2)[42] = "red"

As an example the M1 looks something like this

head(M1)
  ENSG00000006659 ENSG00000008853 ENSG00000070808 ENSG00000084734 ENSG00000086570 ENSG00000091490 ENSG00000100228 ENSG00000105672 ENSG00000106772 ENSG00000108179
1      -0.6790895        9.838845       3.8723699       2.0334803        7.408578        11.63426        5.388844        3.403330        8.402316        12.10071
2      -0.7223128        9.216762       1.9638908       0.2942262        6.407777        11.65133        4.751099        2.862639        8.371253        10.56136
3      -0.6791654        8.901460       1.9595495       1.1313903        8.457811        11.66191        4.814653        2.939381        9.843061        11.04008
4      -0.7388880        9.249982       2.8520751       0.9272291        7.241343        11.37203        4.344087        3.680633        9.009311        11.64196
5      -0.6837360       10.312524       1.3708872       1.3626295        7.800494        11.00893        5.046930        3.630544        7.975663        11.56495
6       0.2083037        8.045902       0.9177488       0.7799186        9.234063        11.16630        5.218017        4.294078        6.153577        11.39511
  ENSG00000116985 ENSG00000124196 ENSG00000125740 ENSG00000127412 ENSG00000127946 ENSG00000128578 ENSG00000138622 ENSG00000140836 ENSG00000149635 ENSG00000153208
1        8.337326       0.6028046        14.95574        3.676078       12.195097        6.266345       0.7590498       10.134101      1.07506924        9.153512
2        7.050741       0.1714888        13.19622        3.044690       11.475854        7.258213       0.9300694       11.619774      0.38973280       10.189657
3        7.675785       0.8536641        15.04202        3.275292       11.331764        6.495812       3.3460528       10.565562     -0.06284153        9.095599
4        8.114534       0.4120677        13.72461        2.954629       11.965255        6.803049       0.6115331        9.842173      0.67297629        9.193858
5        8.852637       0.2637794        14.80603        3.040592       11.733995        6.085176       1.3058149       11.195194     -0.07398686        8.680911
6        8.150247       1.5651071        13.70370        3.678386        9.989683        6.363244       1.2811180       10.964695      0.88423334       10.820407
  ENSG00000159228 ENSG00000160999 ENSG00000164086 ENSG00000166866 ENSG00000166922 ENSG00000171885 ENSG00000177138 ENSG00000185338 ENSG00000186510 ENSG00000205856
1        9.890364        8.853479       10.935791        3.737294        4.531792       0.9634707       0.2378286        6.563547        2.408185       1.1550716
2        7.085259        7.590380        9.670493        4.879658        4.407784       0.4286926       2.2786092        4.555640        2.336665       0.6929619
3        8.294819        8.290893        9.657836        5.134109        3.646726       2.3294132       2.1727927        5.381531        3.271168       0.8168658
4        9.153176        7.964677       10.564849        3.630441        4.645231       1.3016413       0.5430753        5.313173        2.586458       1.1699673
5        7.814855        8.173153       11.026781        4.127289        4.826615       0.4942343       0.6896831        6.481789        1.758393       3.5270260
6        6.933262        8.273962        9.209629        4.554088        4.788554       2.2729740       0.9930626        6.183071        2.182287       0.3210672
  ENSG00000213214 turquoise
1       0.9801103  8.135249
2       0.8521615  6.033414
3       1.5692256  6.050110
4       0.4066649  7.211567
5       2.3361480  6.996665
6       0.4935823  5.172616

Extract colnames to create formula

# Generate formula string
predictor_names1 <- colnames(M1) # exclude the first two columns (time and status)
formula_str1 <- paste("Surv(OS_MONTHS, Status) ~", paste(predictor_names1, collapse = " + "))
# Fit the Cox proportional hazards regression model
my_formula1 <- as.formula(formula_str1)
predictor_names2 <- colnames(M2) # exclude the first two columns (time and status)
formula_str2 <- paste("Surv(OS_MONTHS, Status) ~", paste(predictor_names2, collapse = " + "))
Fit the Cox proportional hazards regression model
my_formula2 <- as.formula(formula_str2)

Running bootstrap

for (i in 1:1000) {
Generate a bootstrap sample
bootSample <- dataset6[sample.int(nrow(dataset6),replace=TRUE),]
Calculate the IQR for each module
IQR1 <- IQR(M1$turquoise)
  IQR2 <- IQR(M2$red)
Fit the Cox proportional hazards regression model for each module
bootFit1 <- coxph(my_formula1,data = bootSample)
  bootFit2 <- coxph(my_formula2,data = bootSample)
Calculate the logHR difference between the two modules
logHR1 <- coef(bootFit1)[[2]] * IQR1
  logHR2 <- coef(bootFit2)[[2]] * IQR2
  logHRdiff <- logHR2 - logHR1
Store the result
if (i == 1) {
    resultStore <- logHRdiff
  } else {
    resultStore <- c(resultStore, logHRdiff)
  }
}
Sort the results
resultStore <- resultStore[order(resultStore)]
Calculate confidence limits for logHR difference
conf_limits <- quantile(resultStore, probs = c(0.025, 0.975))
Print the confidence limits
conf_limits

In the above code to find the IQR for the both the modules I used the module scores which is basically turquoise for first module and red for the second module

Before that I also tried something like this where I tried to find IQR of each column I ran this

IQR1 <-  M1 %>%
       select_if(is.numeric) %>%
    summarise(across(everything(), IQR))
IQR2 <-  M2 %>%
       select_if(is.numeric) %>%
       summarise(across(everything(), IQR))

it didn't work

As suggested in the my previous question

if you did 999 bootstrap samples then the 25th and 975th in order give you 95% confidence limits for the difference between the modules. If a difference of 0 is outside those confidence limits, then you have evidence that one of the modules is superior.

To do the above I ran this which in the above code

# Calculate confidence limits for logHR difference
conf_limits <- quantile(resultStore, probs = c(0.025, 0.975))
Print the confidence limits
conf_limits

Output

2.5%     97.5% 
-1.814380  2.446325

So what would be the output interpretation since it contains 0 there both the model are basically same or similar ?

Coxph model output

Call:
coxph(formula = my_formula1, data = bootSample)
                coef exp(coef) se(coef)      z        p

ENSG00000006659 -0.08480   0.91869  0.18335 -0.463 0.643702
ENSG00000008853 -0.71237   0.49048  0.45977 -1.549 0.121284
ENSG00000070808  0.22058   1.24680  0.18767  1.175 0.239853
ENSG00000084734  0.15579   1.16858  0.25058  0.622 0.534113
ENSG00000086570 -0.78152   0.45771  0.33290 -2.348 0.018894
ENSG00000091490  0.30301   1.35392  0.26720  1.134 0.256792
ENSG00000100228  0.82649   2.28528  0.32949  2.508 0.012128
ENSG00000105672 -0.57979   0.56001  0.22165 -2.616 0.008901
ENSG00000106772  0.21472   1.23952  0.21201  1.013 0.311153
ENSG00000108179  0.59817   1.81879  0.42130  1.420 0.155654
ENSG00000116985 -0.16206   0.85039  0.28520 -0.568 0.569867
ENSG00000124196  0.69766   2.00905  0.33448  2.086 0.036994
ENSG00000125740 -0.26467   0.76746  0.21396 -1.237 0.216094
ENSG00000127412 -0.09850   0.90620  0.22435 -0.439 0.660640
ENSG00000127946  0.33344   1.39575  0.26910  1.239 0.215323
ENSG00000128578 -0.33829   0.71299  0.21157 -1.599 0.109822
ENSG00000138622  0.02457   1.02488  0.37422  0.066 0.947643
ENSG00000140836  0.03179   1.03230  0.28102  0.113 0.909943
ENSG00000149635  0.71572   2.04565  0.23854  3.000 0.002696
ENSG00000153208  0.22391   1.25095  0.20699  1.082 0.279368
ENSG00000159228  0.18013   1.19737  0.22749  0.792 0.428466
ENSG00000160999 -0.56201   0.57006  0.28997 -1.938 0.052600
ENSG00000164086  1.17011   3.22234  0.44175  2.649 0.008078
ENSG00000166866 -0.34015   0.71166  0.28358 -1.200 0.230328
ENSG00000166922  0.48040   1.61673  0.30426  1.579 0.114354
ENSG00000171885 -0.37376   0.68814  0.27954 -1.337 0.181199
ENSG00000177138  0.11922   1.12661  0.30421  0.392 0.695140
ENSG00000185338  0.76505   2.14910  0.22498  3.400 0.000673
ENSG00000186510  0.47312   1.60500  0.23003  2.057 0.039711
ENSG00000205856  0.22150   1.24794  0.32990  0.671 0.501960
ENSG00000213214  0.06063   1.06250  0.22885  0.265 0.791064
turquoise             NA        NA  0.00000     NA       NA
Likelihood ratio test=152.9  on 31 df, p=< 2.2e-16
n= 146, number of events= 92

Why the turquoise is NA?
[1]: How to compare two signature for prediction models? [2]: https://stats.stackexchange.com/users/28500/edm

score 1 · Accepted Answer · answered Apr 28 '23 at 13:29

The failure to get coefficients for the "turquoise" module itself shows the inherent problem in your Cox model. You are double-counting the gene-expression values that go into calculating each case's module score.

A "module" of the type you describe is a linear combination of a set of gene-expression values. That linear combination might come, for example, from a LASSO model based on expression of all genes. For each case you then get a "module" score by taking the sum of the expression values for the genes retained by the LASSO model, with each gene-expression value weighted by its corresponding (penalized) LASSO regression coefficient.

What have done for your survival model is to use not only the module score but also the individual gene-expression values that go into the module. The module score is a linear combination of the individual gene-expression values, so including both the individual values and their combination score leads to a linearly dependent set of predictors. R will let you try to set up such a regression, but it will omit coefficients for those that are linearly dependent on predictors that appear earlier in the formula.

That's why you aren't getting a coefficient for the module.

What you are doing with the Cox model is double-counting the set of selected genes in a way that undercuts the whole point of constructing the module in the first place. If you want to proceed with this type of analysis, then for the Cox model use only the module score and omit all of the individual gene-expression values that go into creating it.

In terms of your problems with IQR(), I'm not fluent in the "tidyverse" and coding-specific matters are off-topic on this site.

Cox model use only the module score and omit all of the individual gene-expression values that go into creating it. thank you for clarifying it I was kind of stuck and in doubt what I implemented here thinking if i understood your point or not — PesKchan, Apr 28 '23 at 19:30

Bootstrapping method survival model interpretation

Extract colnames to create formula

Fit the Cox proportional hazards regression model

Running bootstrap

Generate a bootstrap sample

Calculate the IQR for each module

Fit the Cox proportional hazards regression model for each module

Calculate the logHR difference between the two modules

Store the result

Sort the results

Calculate confidence limits for logHR difference

Print the confidence limits

Print the confidence limits

Output

Coxph model output

1 Answers1

Linked