0

This was one the [question][1] which was helped and suggested by [EdM][2] how to do it in R way.

So far what I have done is this

The dataset6 is my complete dataset which include all my genes and the Scores of the two modules which I have already calculated based on the regression coefficient already obtained.

 dataset6 = inner_join(surv_obj1,dataset5) %>% cbind(M1[c(32)]) %>% cbind(M2[c(42)])
 dim(dataset6)

##Modules : all the genes and the module scores

M1 <- read.csv("Module_turq.txt",sep = "\t")
names(M1)[32] = "turquoise"
M2 <- read.csv("module_red.txt",sep = "\t")
names(M2)[42] = "red"

As an example the M1 looks something like this

head(M1)
  ENSG00000006659 ENSG00000008853 ENSG00000070808 ENSG00000084734 ENSG00000086570 ENSG00000091490 ENSG00000100228 ENSG00000105672 ENSG00000106772 ENSG00000108179
1      -0.6790895        9.838845       3.8723699       2.0334803        7.408578        11.63426        5.388844        3.403330        8.402316        12.10071
2      -0.7223128        9.216762       1.9638908       0.2942262        6.407777        11.65133        4.751099        2.862639        8.371253        10.56136
3      -0.6791654        8.901460       1.9595495       1.1313903        8.457811        11.66191        4.814653        2.939381        9.843061        11.04008
4      -0.7388880        9.249982       2.8520751       0.9272291        7.241343        11.37203        4.344087        3.680633        9.009311        11.64196
5      -0.6837360       10.312524       1.3708872       1.3626295        7.800494        11.00893        5.046930        3.630544        7.975663        11.56495
6       0.2083037        8.045902       0.9177488       0.7799186        9.234063        11.16630        5.218017        4.294078        6.153577        11.39511
  ENSG00000116985 ENSG00000124196 ENSG00000125740 ENSG00000127412 ENSG00000127946 ENSG00000128578 ENSG00000138622 ENSG00000140836 ENSG00000149635 ENSG00000153208
1        8.337326       0.6028046        14.95574        3.676078       12.195097        6.266345       0.7590498       10.134101      1.07506924        9.153512
2        7.050741       0.1714888        13.19622        3.044690       11.475854        7.258213       0.9300694       11.619774      0.38973280       10.189657
3        7.675785       0.8536641        15.04202        3.275292       11.331764        6.495812       3.3460528       10.565562     -0.06284153        9.095599
4        8.114534       0.4120677        13.72461        2.954629       11.965255        6.803049       0.6115331        9.842173      0.67297629        9.193858
5        8.852637       0.2637794        14.80603        3.040592       11.733995        6.085176       1.3058149       11.195194     -0.07398686        8.680911
6        8.150247       1.5651071        13.70370        3.678386        9.989683        6.363244       1.2811180       10.964695      0.88423334       10.820407
  ENSG00000159228 ENSG00000160999 ENSG00000164086 ENSG00000166866 ENSG00000166922 ENSG00000171885 ENSG00000177138 ENSG00000185338 ENSG00000186510 ENSG00000205856
1        9.890364        8.853479       10.935791        3.737294        4.531792       0.9634707       0.2378286        6.563547        2.408185       1.1550716
2        7.085259        7.590380        9.670493        4.879658        4.407784       0.4286926       2.2786092        4.555640        2.336665       0.6929619
3        8.294819        8.290893        9.657836        5.134109        3.646726       2.3294132       2.1727927        5.381531        3.271168       0.8168658
4        9.153176        7.964677       10.564849        3.630441        4.645231       1.3016413       0.5430753        5.313173        2.586458       1.1699673
5        7.814855        8.173153       11.026781        4.127289        4.826615       0.4942343       0.6896831        6.481789        1.758393       3.5270260
6        6.933262        8.273962        9.209629        4.554088        4.788554       2.2729740       0.9930626        6.183071        2.182287       0.3210672
  ENSG00000213214 turquoise
1       0.9801103  8.135249
2       0.8521615  6.033414
3       1.5692256  6.050110
4       0.4066649  7.211567
5       2.3361480  6.996665
6       0.4935823  5.172616

Extract colnames to create formula

# Generate formula string
predictor_names1 <- colnames(M1) # exclude the first two columns (time and status)
formula_str1 <- paste("Surv(OS_MONTHS, Status) ~", paste(predictor_names1, collapse = " + "))
# Fit the Cox proportional hazards regression model
my_formula1 <- as.formula(formula_str1)

predictor_names2 <- colnames(M2) # exclude the first two columns (time and status) formula_str2 <- paste("Surv(OS_MONTHS, Status) ~", paste(predictor_names2, collapse = " + "))

Fit the Cox proportional hazards regression model

my_formula2 <- as.formula(formula_str2)

Running bootstrap

for (i in 1:1000) {

Generate a bootstrap sample

bootSample <- dataset6[sample.int(nrow(dataset6),replace=TRUE),]

Calculate the IQR for each module

IQR1 <- IQR(M1$turquoise) IQR2 <- IQR(M2$red)

Fit the Cox proportional hazards regression model for each module

bootFit1 <- coxph(my_formula1,data = bootSample) bootFit2 <- coxph(my_formula2,data = bootSample)

Calculate the logHR difference between the two modules

logHR1 <- coef(bootFit1)[[2]] * IQR1 logHR2 <- coef(bootFit2)[[2]] * IQR2 logHRdiff <- logHR2 - logHR1

Store the result

if (i == 1) { resultStore <- logHRdiff } else { resultStore <- c(resultStore, logHRdiff) } }

Sort the results

resultStore <- resultStore[order(resultStore)]

Calculate confidence limits for logHR difference

conf_limits <- quantile(resultStore, probs = c(0.025, 0.975))

Print the confidence limits

conf_limits

In the above code to find the IQR for the both the modules I used the module scores which is basically turquoise for first module and red for the second module

Before that I also tried something like this where I tried to find IQR of each column I ran this

IQR1 <-  M1 %>%
       select_if(is.numeric) %>%
    summarise(across(everything(), IQR))
IQR2 <-  M2 %>%
       select_if(is.numeric) %>%
       summarise(across(everything(), IQR))

it didn't work

As suggested in the my previous question

if you did 999 bootstrap samples then the 25th and 975th in order give you 95% confidence limits for the difference between the modules. If a difference of 0 is outside those confidence limits, then you have evidence that one of the modules is superior.

To do the above I ran this which in the above code

# Calculate confidence limits for logHR difference
conf_limits <- quantile(resultStore, probs = c(0.025, 0.975))

Print the confidence limits

conf_limits

Output

2.5%     97.5% 
-1.814380  2.446325 

So what would be the output interpretation since it contains 0 there both the model are basically same or similar ?

Coxph model output

Call:
coxph(formula = my_formula1, data = bootSample)
                coef exp(coef) se(coef)      z        p

ENSG00000006659 -0.08480 0.91869 0.18335 -0.463 0.643702 ENSG00000008853 -0.71237 0.49048 0.45977 -1.549 0.121284 ENSG00000070808 0.22058 1.24680 0.18767 1.175 0.239853 ENSG00000084734 0.15579 1.16858 0.25058 0.622 0.534113 ENSG00000086570 -0.78152 0.45771 0.33290 -2.348 0.018894 ENSG00000091490 0.30301 1.35392 0.26720 1.134 0.256792 ENSG00000100228 0.82649 2.28528 0.32949 2.508 0.012128 ENSG00000105672 -0.57979 0.56001 0.22165 -2.616 0.008901 ENSG00000106772 0.21472 1.23952 0.21201 1.013 0.311153 ENSG00000108179 0.59817 1.81879 0.42130 1.420 0.155654 ENSG00000116985 -0.16206 0.85039 0.28520 -0.568 0.569867 ENSG00000124196 0.69766 2.00905 0.33448 2.086 0.036994 ENSG00000125740 -0.26467 0.76746 0.21396 -1.237 0.216094 ENSG00000127412 -0.09850 0.90620 0.22435 -0.439 0.660640 ENSG00000127946 0.33344 1.39575 0.26910 1.239 0.215323 ENSG00000128578 -0.33829 0.71299 0.21157 -1.599 0.109822 ENSG00000138622 0.02457 1.02488 0.37422 0.066 0.947643 ENSG00000140836 0.03179 1.03230 0.28102 0.113 0.909943 ENSG00000149635 0.71572 2.04565 0.23854 3.000 0.002696 ENSG00000153208 0.22391 1.25095 0.20699 1.082 0.279368 ENSG00000159228 0.18013 1.19737 0.22749 0.792 0.428466 ENSG00000160999 -0.56201 0.57006 0.28997 -1.938 0.052600 ENSG00000164086 1.17011 3.22234 0.44175 2.649 0.008078 ENSG00000166866 -0.34015 0.71166 0.28358 -1.200 0.230328 ENSG00000166922 0.48040 1.61673 0.30426 1.579 0.114354 ENSG00000171885 -0.37376 0.68814 0.27954 -1.337 0.181199 ENSG00000177138 0.11922 1.12661 0.30421 0.392 0.695140 ENSG00000185338 0.76505 2.14910 0.22498 3.400 0.000673 ENSG00000186510 0.47312 1.60500 0.23003 2.057 0.039711 ENSG00000205856 0.22150 1.24794 0.32990 0.671 0.501960 ENSG00000213214 0.06063 1.06250 0.22885 0.265 0.791064 turquoise NA NA 0.00000 NA NA

Likelihood ratio test=152.9 on 31 df, p=< 2.2e-16 n= 146, number of events= 92

Why the turquoise is NA?
[1]: How to compare two signature for prediction models? [2]: https://stats.stackexchange.com/users/28500/edm

PesKchan
  • 155
  • 5

1 Answers1

1

The failure to get coefficients for the "turquoise" module itself shows the inherent problem in your Cox model. You are double-counting the gene-expression values that go into calculating each case's module score.

A "module" of the type you describe is a linear combination of a set of gene-expression values. That linear combination might come, for example, from a LASSO model based on expression of all genes. For each case you then get a "module" score by taking the sum of the expression values for the genes retained by the LASSO model, with each gene-expression value weighted by its corresponding (penalized) LASSO regression coefficient.

What have done for your survival model is to use not only the module score but also the individual gene-expression values that go into the module. The module score is a linear combination of the individual gene-expression values, so including both the individual values and their combination score leads to a linearly dependent set of predictors. R will let you try to set up such a regression, but it will omit coefficients for those that are linearly dependent on predictors that appear earlier in the formula.

That's why you aren't getting a coefficient for the module.

What you are doing with the Cox model is double-counting the set of selected genes in a way that undercuts the whole point of constructing the module in the first place. If you want to proceed with this type of analysis, then for the Cox model use only the module score and omit all of the individual gene-expression values that go into creating it.

In terms of your problems with IQR(), I'm not fluent in the "tidyverse" and coding-specific matters are off-topic on this site.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Cox model use only the module score and omit all of the individual gene-expression values that go into creating it. thank you for clarifying it I was kind of stuck and in doubt what I implemented here thinking if i understood your point or not – PesKchan Apr 28 '23 at 19:30