How to run enrichment analysis of protein functional annotation?

Question

I have a lot of protein clusters. I want to perform an enrichment analysis of their functional annotations, against reference datasets or list of genes I select.

More precisely: a method yields cluster of proteins. I want to decide, for each cluster (which is a set of proteins identifiers), if it holds meaning regarding the proteins it contains, by comparing the annotations found in the cluster and the annotations found in common or specific datasets.

Initially, I used DAVID, which compute the GO annotations from the protein list, then perform the enrichment analysis against common datasets.

However, DAVID suffer of the following drawbacks:

since I didn't find any other way to use it, I'm limited by the web interface and its limitations (number of genes, url size, number of request/day).
automation over the web interface is a hacky script.

The team behind seems to offer a way to run DAVID in-house (allowing, hopefully, to relax the limitations), but I can't find any way to download the database from their website.

What are the alternative ways to get enrichment analysis over proteins in a reliable and automated way ?

This paper on the obsolescence of some enrichment analysis tools might be a good starting point to pick one (note that DAVID has been updated since): https://www.nature.com/nmeth/journal/v13/n9/full/nmeth.3963.html — mgalardini, Aug 21 '17 at 11:09
What is the biological question? Why do you want to compare the functional annotations of your proteins against reference datasets or your list of genes? (I have usually found the opposite: I have a list of genes and I need to know what functions/pathways/process are they involved) — llrs, Aug 22 '17 at 09:39
The question is do a cluster expose an enrichment in functional annotations. The method studied yields many clusters, and i want to decide if these clusters hold meaning regarding the proteins it contains, by comparing the annotations found in the cluster and the annotations found in common or specific datasets. — aluriak, Aug 22 '17 at 09:47
To make it even clearer, the clusters you mean you have a list of interacting proteins (in a complex or in a pathway) or that some tools grouped your proteins in a cluster? Could you [edit] the question to include the information in the body of your question? — llrs, Aug 22 '17 at 09:49
A cluster is a set of proteins. There is no assumption about their interaction. I updated the question accordingly. — aluriak, Aug 22 '17 at 09:53
And the annotations you refer, what are they about? Sorry by these questions, but as you explain more is easier for me to think about an answer that fits you — llrs, Aug 22 '17 at 10:00
The proteins -> annotations mapping was initially performed by DAVID. I think each database would have its own way to perform that mapping. I do not want to perform that step myself, but, as for the enrichment analysis itself, i will do it myself with some scripting only if its necessary. AFAIK, considered annotations was GO annotations. I updated the question. There is no problem ; i understand the value of your questions. — aluriak, Aug 22 '17 at 10:31

benn · Answer 1 · 2017-08-22T10:43:13.063

3

I would recommend R instead of sticking to websites.

There are many tools for enrichment analysis. I like to use GOseq, which is initially made for RNAseq data, but can also be used for proteins (I have used it for proteomics). You can use the length of the protein instead of transcript length (to correct for length bias), or you can exclude the length information by just using the hypergeometric method.

See manual for code.

Subsequently you can use gogadget, a tool I made for visualization of GOseq results with a heatmap or cytoscape network.

Additionally, you might wanna take a look at clusterProfiler. With this package you can do enrichment analysis for many clusters in parallel.

edited Aug 22 '17 at 10:43

answered Aug 21 '17 at 11:26

benn

3,571
9
28

Under what circumstances is it appropriate to correct for bias using the length of the protein? – Chris Mungall Aug 21 '17 at 22:36
What I understand from the correction of bias in GOseq, is that if there is no bias a correction would not harm the analysis (because there is nothing to correct for). See previous discussion on Bioconductor. However, I do the hypergeometric test often, because the researchers (biologists) I work with prefer a test that they understand. The hypergeometric (Fisher exact) test is something many biologists understand. – benn Aug 22 '17 at 07:50
@cmungall This would be a good question on its own. Please consider asking it – llrs Aug 22 '17 at 09:37

score 3 · Accepted Answer · answered Aug 23 '17 at 10:49

DAVID depend on a couple of databases from the Gene Ontology Consortium, Reactome, KEGG,... most of them are accessible via Bioconductor. To perform an enrichment analysis you can have a look at the tutorial of the several packages in Bioconductor that do this.

Some of the most important for analyzing enrichment in GO terms are topGO, goseq, GOstats. I would also recommend GOSemSim if you want to compare between GO to focus on a specific GO terms.

Other important packages are the fgsea to test any kind of gene set (which is similar to the one hosted by the Broad Institute), gsva for enrichment analysis by sample, limma has some functions for functional enrichment too. Piano, GSCA, SPIA are also worth mentioning.

Bioconductor has "standard" data sets of the expressions of some cells, like airway and ALL frequently used in vignettes. They are not reference data sets because there isn't a reference expression for a cell of an organisms. It depends on the type of cell, the experiment, the conditions...

How to run enrichment analysis of protein functional annotation?

2 Answers2