How to adapt the fgseaL function to perform rapidGSEA computation of gene ranks across 9 different phenotype labels?

Question

I wish to adapt the r language function fgseaL, https://github.com/ctlab/fgsea , to perform rapidGSEA, https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1244-x , computation of inter-class deviation per gene and the subsequent gene rank sorting operation on 9 different phenotype labels as illustrated in the diagram immediately below:

I thought of applying the R-language rank() function on the Expression Data Matrix D. If that is not correct, what sequence of R language commands should we apply to the Expression Data Matrix D to calculate a key value sorted deviation measure across 8 labeled human leukemia groups and a healthy labeled normal control group prior to running fgseaL?

I show below how fgseaL finds the correlation matrix between the R language variable , mat, which corresponds to the Expression Data Matrix D and the R language variable , labels , which is a vector of gene phenotype labels

    tmatSc <- scale(t(mat))
    labelsSc <- scale(labels)[, 1]

    minSize <- max(minSize, 1)

    pathwaysFiltered <- lapply(pathways, function(p) { as.vector(na.omit(fmatch(p, rownames(mat)))) })
    pathwaysSizes <- sapply(pathwaysFiltered, length)

    toKeep <- which(minSize <= pathwaysSizes & pathwaysSizes <= maxSize)
    m <- length(toKeep)

    if (m == 0) {
        return(data.table(pathway=character(),
                          pval=numeric(),
                          padj=numeric(),
                          ES=numeric(),
                          NES=numeric(),
                          nMoreExtreme=numeric(),
                          size=integer(),
                          leadingEdge=list()))
    }

    pathwaysFiltered <- pathwaysFiltered[toKeep]
    pathwaysSizes <- pathwaysSizes[toKeep]

    corRanks <- var(tmatSc, labelsSc)[,1]
    ranksOrder <- order(corRanks, decreasing=T)
    ranksOrderInv <- invPerm(ranksOrder)
    stats <- corRanks[ranksOrder]

    pathwaysReordered <- lapply(pathwaysFiltered, function(x) ranksOrderInv[x])

    gseaStatRes <- do.call(rbind,
                           lapply(pathwaysReordered, calcGseaStat,
                                  stats=stats,
                                  returnLeadingEdge=TRUE))

I found a problem with the algorithm shown immediately below.
correcttest <- data.frame(names = row.names(normal))
correcttest <- cbind(correcttest3, normal)
correcttest <- cbind(correcttest3, ALL3m)
rownames(correcttest) <- correcttest$names
correcttest$names <- NULL
correctlabelnormal <- rep(0:0, 73)
correctlabelALL3m <- rep(1:1, 122)
correctlabel <- as.vector(c(correctlabelnormal,correctlabelALL3m))
s <- apply(correcttest, 1, function(x) coef(lm(x~correctlabel))[2])
o <- rank(s)
o <- max(o) - o + 1
res <- fgseaL(df,o,correctlabel,nperm = 2000,minSize = 1, maxSize=50000)
empty data table (0 rows) of 8 columns:   pathway,pval,padj,ES,NES,nMoreExtreme,size

I found the binary phenotype labeled group fgseaL test results below looked satisfactory.    
correcttest <- data.frame(names = row.names(normal))
correcttest <- cbind(correcttest3, normal)
correcttest <- cbind(correcttest3, ALL3m)
rownames(correcttest) <- correcttest$names
correcttest$names <- NULL
correctlabelnormal <- rep(0:0, 73)
correctlabelALL3m <- rep(1:1, 122)
correctlabel <- as.vector(c(correctlabelnormal,correctlabelALL3m))
fgseaL(df,correcttest,correctlabel,nperm = 2000,minSize = 1, maxSize=50000)
       pathway        pval        padj         ES       NES nMoreExtreme  size
1: Gene.Symbol 0.003940887 0.003940887 -0.2460126 -1.180009            3 45714
                                 leadingEdge
1: AKIRIN2,LRRC20,HSPA5,HSPA5,DTWD2,ZFYVE28,

Thank you for your consideration.

It might be worth to use the package provided by the authors in github. This way you don't need to implement it again — llrs, Aug 08 '17 at 10:14
@Llopis, Thank you for the good suggestion. I will try installing it later today.. — Frank, Aug 08 '17 at 10:54
@Llopis, Can we run rapidGSEA without the CUDA enhancemnets since I am not using a computer with a NVIDIA cpu? Thank you. — Frank, Aug 12 '17 at 14:49
I think you could be able to do so, but I am not the maintainer neither the author of the code. You could also check the source code. But that probably is better to check and ask as another question — llrs, Aug 12 '17 at 18:11
@Llopis, How do I install the NVIDIA c++ compiler to compile cudaGSEA? I already installed the NVIDIA CUDA Toolkit. Thank you. — Frank, Aug 12 '17 at 19:09
@Llopis, There is a Makefile in github gravitino/cudaGSEA/src. What should I do with it if I am running Windows? Thanks — Frank, Aug 12 '17 at 20:25
@Llopis, When I perform the base install of the NVIDIA CUDA Toolkit on Windows 8.1, I get a diagnostic message , "NVIDIA Installer cannot continue. The NVIDIA graphics driver is not compatible with this version of Windows". What should one do here to make nvcc compile okay? Thank you. — Frank, Aug 12 '17 at 22:29
if you have a problem don't hesitate to ask a new question on a new thread. — llrs, Aug 13 '17 at 08:43
@Llopis, Could you please comment on my answer which I just wrote below? Thank you for your help. — Frank, Aug 13 '17 at 21:45
It is not a question where I need clarification to answer it, so I won't comment it. Please if you have a new question post it as such. I will stop paying attention to this question an answers. — llrs, Aug 14 '17 at 09:27

Frank · Answer 1 · 2017-08-23T16:16:10.837

I finally found another answer to my question. Please read this great article in May 12 2017 BioMed Central (BMC) Bioinformatics article titled Ranking metrics in gene set enrichment analysis: do they matter?.

Also, please read this blog , Diving into Genetics and Genomics: Gene Set Enrichment Analysis (GSEA) explained.

After reading these two articles, my choice for the best rapidGSEA local ranking measure is Minimum Significant Difference (i.e., MSD), because it has the best overall false positive rate (i.e., FPR) for larger sample sizes.

Finally, it is important to realize that fgseaL's phenotype labeling can hypothetically be emulated by GSEA with either one of it's sixteen possible ranking metrics or a custom ranking metric.

score 2 · Accepted Answer · answered Aug 07 '17 at 12:14

2

You rank the fit coefficient rather than the original score matrix. So, given a score matrix, D:

D = matrix(c(22,20,9,8,46,22,18,10,3,18,3,29,2,1,5,45,43,47,17,5,14,44,21,36), byrow=T, ncol=6)
cl = c(0,0,0,1,1,1)
s = apply(m, 1, function(x) coef(lm(x~cl))[2]) # [1]
o = rank(s)
o = max(o) - o + 1 # [2]

o is then the rank of each row.

[1] This fits each row as a linear model of cl and extracts the cl coefficient.

[2] This converts the ranking to be the same as shown in the figure. I don't know if this is important, but I would assume so given how GSEA methods tend to work.

answered Aug 07 '17 at 12:14

Devon Ryan

19,602
2
29
60

Thank you for your excellent answer. I just accepted it. Could I pass a R language nine(9) valued phenotype label vector to where you specify cl = c(0,0,0,1,1,1) or does cl have to be a binary phenotype label? The reason I ask this question is that we have 8 human leukemia groups and 1 healthy normal control group. – Frank Aug 07 '17 at 13:32
The method itself is only designed for two-group comparisons, so if you have more phenotypes you'll either have to do one comparison at a time or come up with a different method (I'm not sure what at multi-factor GSEA output would even mean in such cases). – Devon Ryan Aug 07 '17 at 13:34
Thank you for your reply to my comment. In the original Bioinformatics Stack Exchange question above , I have added some fgseaL source code which calculates the correlation matrix between the Expression Data Matrix D and the R language variable , labels , which is a vector of gene phenotype labels. Could I ask you to explain the purpose of this fgseaL source code extract? – Frank Aug 07 '17 at 13:59
@Frank Please post that as a new question, though please read the preprint linked to from github first, since I assume that goes over the method. – Devon Ryan Aug 07 '17 at 14:03
I will post the new question later today. Also, I will try out your rank the linear regression fit coefficient method on our data and report the results back to you later today. Thanks. – Frank Aug 07 '17 at 14:52
I tested your rank the linear regression fit coefficient R language algorithm on our data with fgseaL and encountered the diagnostic message Empty data.table (0 rows) of 8 cols: pathway,pval,padj,ES,NES,nMoreExtreme. Could I ask what I did wrong? My R language code is shown at the bottom of my original question. Nonetheless , you are correct about doing two-group comparisons one at a time as I demonstrate at the bottom of my original question. – Frank Aug 07 '17 at 23:43
That's not my algorithm, it's just how to do that step from the paper. You'll get an empty output from the R code you posted whenever everything is filtered out. Whether that makes sense or not is a completely different question. – Devon Ryan Aug 08 '17 at 06:41
Thank you for your comment. What is the name of the paper and page which describes that step? – Frank Aug 08 '17 at 07:22
@Frank It's literally the image in your post and the paper you linked to. – Devon Ryan Aug 08 '17 at 07:23
I don't see any mention of linear regression coefficient in the paper I linked to. Where did you get that idea? Also, fgseaL expects the mat input to be an matrix of doubles with the raw key value to be sorted while R language rank outputs a vector of integer rank scores. Thank you. – Frank Aug 08 '17 at 09:03
The "inter-class deviation" == "difference between groups" == "coefficient from a linear model". You're mixing and matching methods, you'll have to figure out how to get one to work with the other. – Devon Ryan Aug 08 '17 at 09:05
Is it permissible to perform the two phenotype group comparison using fgseaL without scaling on the input mat or D systematically one group at a time between each of the 8 diseased groups and the normal healthy group rather than using the linear regression coefficient R language source code as illustrated in the original question? Thank you. – Frank Aug 08 '17 at 10:53
@Frank Please post things like that as their own question. – Devon Ryan Aug 08 '17 at 10:54
Why is the fgseaL pval equal to the padj of 0.003940887 when I do the two group comparison between the normal control group and the ALL3m human leukemia disease group of subjects? Thank you. – Frank Aug 08 '17 at 14:52
When I do res<-fgsea(df,correcttest,nperm = 1000,minSize = 1, maxSize=50000) without the phenotype label feature why do I get the error in [.data.frame(x, order(x, na.last = na.last, decreasing = decreasing)) , undefined columns selected? Thank you. – Frank Aug 08 '17 at 15:03
Let us continue this discussion in chat. – Devon Ryan Aug 08 '17 at 18:21
Could you please comment on my answer which I just wrote below? Thank you for your help. – Frank Aug 13 '17 at 21:46

Frank · Answer 3 · 2017-08-13T21:52:33.080

This weekend, I installed the R language packages rapidGSEA and the Broad Institute GSEA , the Nvidia CUDA Toolkit 6.5-- Custom installation option and MINGW makefiles on my son's Windows 8.1 notebook computer. This step was very tricky to accomplish because I had to maneuver around the "NVIDIA Installer cannot continue. The NVIDIA graphics driver is not compatible with this version of Windows" installation error message which prevents the NVIDIA compilers from being installed correctly. My objective is to run rapidGSEA without the CUDA enhancements since I am not using a computer with a NVIDIA GPU. I did this step because rapidGSEA and Broad Institute GSEA have a significantly richer set of gene set enrichment analytics than fgseaL as shown below:

The GSEA method takes five default arguments and three optional arguments GSEA <- function(exprsData, labelList, geneSets, numPermutations, metricString, dumpFileName="", checkInput=TRUE, doublePrecision=FALSE) {...}

exprsData, labelList and geneSets refer to the data obtained in the previous section. numPermutations denotes the number of permutations in the resampling test, metricString denotes the local ranking measure (one of the following):

naive_diff_of_classes
naive_ratio_of_classes
naive_log2_ratio_of_classes
stable_diff_of_classes
stable_ratio_of_classes
stable_log2_ratio_of_classes
onepass_signal2noise
onepass_t_test
twopass_signal2noise
twopass_t_test
stable_signal2noise
stable_t_test
overkill_signal2noise
overkill_t_test

Glad you got it working. Not sure what else I'm supposed to comment on here. — Devon Ryan, Aug 14 '17 at 06:47
@Devon Ryan, If metricString denotes the local ranking measure we are requesting that rapidGSEA use when it is invoked, which metricString value should I choose so that rapidGSEA operates in a manner most closely resembling fgseaL? Thank you. — Frank, Aug 14 '17 at 17:11
Don't try to make one work like the other. If you start making your own method then you have to get it through peer review. It doesn't appear you have the background necessary to do that. Use rapidGSEA or fgseaL, stop trying to merge them. — Devon Ryan, Aug 15 '17 at 07:11

How to adapt the fgseaL function to perform rapidGSEA computation of gene ranks across 9 different phenotype labels?

3 Answers3