Using multiple genes building gene signature and survival analysis

Question

This is the paper which is kind of I'm trying to implement based on my set of genes which i have narrowed down from WGCNA analysis.

Its a both conceptual and R related question[How to do it in R] my lack of conpetual clarity.

For single gene Im clear what to do where I can divide or split my patinet group based on low and high expression from a particular gene and then I can do survival analysis as show this post

Now based on the above paper I cited I would like to highlight upon this figure, in the figure they have mentioned LS17(low standard and High). LSC17 is the combination of 17 genes which they are using to classify the Leukemia Cohort.

So if i have to do it in R on how to go about it since I'm unable to figure out how to go about it? As here in case instead of 1 gene I have 17 genes so How should I put a filter to categorize my patient samples as Low/Standard/High

As shown in the biostar post

survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
  'X203666_at', 'X205680_at')]
colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
  'CXCL12', 'MMP10')
set Z-scale cut-offs for high and low expression
highExpr <- 1.0
lowExpr <- -1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
  ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
  ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))
relevel the factors to have mid as the ref level
survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
  levels = c('Mid', 'Low', 'High'))
survplotdata$MMP10 <- factor(survplotdata$MMP10,
  levels = c('Mid', 'Low', 'High'))

Here they have listed two genes which is CXCL12 and MMP10 gene but here they are calculating them differently.

So my question is how to do it for group of genes together?

Giving a dummy dataframe here

set.seed(123)
nr1 = 4; nr2 = 8; nr3 = 6; nr = nr1 + nr2 + nr3
nc1 = 6; nc2 = 8; nc3 = 10; nc = nc1 + nc2 + nc3
mat = cbind(rbind(matrix(rnorm(nr1*nc1, mean = 1,   sd = 0.5), nr = nr1),
          matrix(rnorm(nr2*nc1, mean = 0,   sd = 0.5), nr = nr2),
          matrix(rnorm(nr3*nc1, mean = 0,   sd = 0.5), nr = nr3)),
    rbind(matrix(rnorm(nr1*nc2, mean = 0,   sd = 0.5), nr = nr1),
          matrix(rnorm(nr2*nc2, mean = 1,   sd = 0.5), nr = nr2),
          matrix(rnorm(nr3*nc2, mean = 0,   sd = 0.5), nr = nr3)),
    rbind(matrix(rnorm(nr1*nc3, mean = 0.5, sd = 0.5), nr = nr1),
          matrix(rnorm(nr2*nc3, mean = 0.5, sd = 0.5), nr = nr2),
          matrix(rnorm(nr3*nc3, mean = 1,   sd = 0.5), nr = nr3))
   )
mat = mat[sample(nr, nr), sample(nc, nc)] # random shuffle rows and columns
rownames(mat) = paste0("gene", seq_len(nr))
colnames(mat) = paste0("LSC", seq_len(nc))

Transposing the dataframe

 mat2 <- mat %>% as.data.frame() %>% t() %>% as.data.frame() %>% rownames_to_column("Patient")

In my dummy dataframe I have a total of 18 rows or genes so i would like to categorize all my columns[patient] based on the combination of 5 rows(genes) such as row1,row2,row3,row4,row5 as Five Low/ Five standard / Five high How to do it in R?

Any suggestion or help would be really appreciated

It sounds like you don't know how to calculate the LSC17 score? If that's the case, I'd suggest reading the reference in the paper you linked to: https://pubmed.ncbi.nlm.nih.gov/27926740/. — David B, Mar 21 '23 at 03:15
no that part I'm clear i want to know how to categorize data into group to test the survival differences — PesKchan, Mar 21 '23 at 06:28
"High and low LSC17 groups were defined as above or below the median LSC17 score for the cohort, respectively." So probably a 30% quantile split. — David B, Mar 21 '23 at 12:55
"So probably a 30% quantile split" how to do this? mean in terms of R translation given the dataframe I have created instead of 17 genes from a combination of 5 genes ? how to do it? — PesKchan, Mar 21 '23 at 16:12
The 17 genes are combined into a single variable when you make LSC17. After that, it's just quantile(LSC17,probs=c(0.3333,0.6666)) — David B, Mar 21 '23 at 17:05
Although it's frequently done, splitting continuous predictor variables into groups is not a good idea. If you are going to combine expression levels of 17 genes in some way, try working instead with a continuous combination rather than categorizing it. Also, how many patients and "events" (deaths, recurrences, or whatever) do you have for the survival analysis? — EdM, Mar 21 '23 at 18:54
@EdM i would have gone to one of your answer and asked you suggestion ... how many patients and "events" (deaths, recurrences, or whatever) do you have for the survival analysis in my TCGA cohort I have 147 patient — PesKchan, Mar 21 '23 at 19:05
@EdM "try working instead with a continuous combination rather than categorizing it" pardon my ignorance I would like to know how this one I understand the word you said but in terms of implementation I'm not able to translate ? if possible any post or if dummy solution based on my data-frame that would be really helpful — PesKchan, Mar 21 '23 at 19:07

score 3 · Accepted Answer · 2023-05-28T22:28:00.193

Not an answer, but too long for a comment.

First of all, please specfiy the dependency on {tibble} by library(tibble) or tibble::rownames_to_column("Patient").

Just to make things clear: according to the linked paper, "LSC17" is not a patient but a score calculated as follows (cp. Methods):

For each gene, the probeset with the highest average GE in the training data was selected to represent that gene. To extract a core subset of genes from among the 43 that were more highly expressed in LSC + cell fractions that best explained patient outcomes in the training cohort, we used a linear regression technique based on the LASSO algorithm as implemented in the glmnet 1.9-8 R package16,17 , while enabling leave-one-out cross-validation to fit a Cox regression model. A minimal subset of 17 genes was selected whose weighted combined GE (LSC17 score) was highly correlated to survival outcomes in the training cohort.

Assuming you have done this, and gene1, ... gene18 (gene17in the paper) is your result, the subsequent step is describred as (cp. Methods)

The LSC17 score is calculated for each patient as a linear combination of GE of these 17 genes weighted by regression coefficients that were estimated from the training data as follows: LSC17 score = (DNMT3B × 0.0874) + (ZBTB46 × −0.0347) + (NYNRIN × 0.00865) + (ARHGAP22 × −0.0138) + (LAPTM4B × 0.00582) + (MMRN1 × 0.0258) + (DPYSL3 × 0.0284) + (KIAA0125 × 0.0196) + (CDK6 × −0.0704) + (CPXM1 × −0.0258) + (SOCS2 × 0.0271) + (SMIM24 × −0.0226) + (EMP1 × 0.0146) + (NGFRAP1 × 0.0465) + (CD34 × 0.0338) + (AKR1C3 × −0.0402) + (GPR56 × 0.0501).

Since I am not sure what your training data is or should be, I demonstrate the next step based on the provided data. Note, you (probably) need to perform a regression on the training data as mentioned.

If gene1, ... gene5 is an appropritate choice/shortcut/..., and further assuming that every row correspondents to a patient, you can compute LSC17score as follows

# data handling 
df <- as.data.frame(mat2[, -1])
lets create some toy regression coefficients
set.seed(032223)
beta <- round(runif(n = 5, min = -.25, max = .25), 2) / 10
beta <- c(0.009, -0.004, -0.004, -0.007, 0.009)
linear combination
df$LSC17score <- rowSums(
  sweep(x = df[, c(paste0("gene", 1:5))], 
        MARGIN = 2,

        STATS = beta, 
        FUN = *)
  )

Transformation of continous data to categorical means an information loss:

# c := categorical  
df$LSC17c <- cut(x = df$LSC17score,
                  breaks = 3,
                  labels = c("FiveLow", "FiveStandard", "FiveHigh")
                  )

A glimpse of the result:

> head(df[, c("LSC17score", "LSC17c")], n = 7)
   LSC17score       LSC17c
1  0.07762944     FiveHigh
2 -0.53205310 FiveStandard
3 -0.39538182 FiveStandard
4  0.19058862     FiveHigh
5  0.10449223     FiveHigh
6 -0.34134344 FiveStandard
7 -0.97711100      FiveLow

Maybe renaming to LSC5score and LSC5c is good practice, since we only use gene1, ..., gene5.

Well I'm not sure how to thank you after spending more than 4 days on this I was stuck and not able to process what exactly I have to do to reach there thanks a ton again . — PesKchan, Mar 22 '23 at 09:25
well honestly sweep(x = df[, c(paste0("gene", 1:5))], MARGIN = 2, STATS = beta, FUN =*) ) I would have never thought something like this due to my restricted R skills — PesKchan, Mar 22 '23 at 09:37

Using multiple genes building gene signature and survival analysis

set Z-scale cut-offs for high and low expression

relevel the factors to have mid as the ref level

1 Answers1

lets create some toy regression coefficients

set.seed(032223)

beta <- round(runif(n = 5, min = -.25, max = .25), 2) / 10

linear combination

Linked