Long-form gene summary

Question

The following (thx to these answers) retrieves the info, in particular a long summary (such as "The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette... etc."), for a few mouse genes.

library(dplyr)
org.Mm.eg.db::org.Mm.egSYMBOL2EG %>% 
  as.data.frame() %>% 
  pull(gene_id) %>% 
  head() %>%  # !!!
  rentrez::entrez_summary(id = ., db = "gene") %>%
  { do.call("rbind", .) } %>%
  as.data.frame() %>%
  select(name, description, summary)

Note the head. The protocol/server does not allow large queries. How can I get the same information (incl. the long summary) for all the mouse genes at once?

Chunk your query, then use purrr::insistently. No need for a loop, this can be vectorised. I also recommend against a fixed-length sleep; using an exponential backoff is more polite. — Konrad Rudolph, Jun 01 '21 at 13:51

score 1 · Answer 1 · answered Jun 01 '21 at 13:32

You could wrap the call too entrez_summary in a loop.
For example to cut the data into chunks of five genes:

gene.vector # this would be the genes you are interested in
gene.chunks <- ceiling(length(gene.vector) / 5)
# initializing an empty list
summary.list <- vector(mode = "list", length = gene.chunks)
for(i in gene.chunks){
   # select the ith set of five genes from your gene vector
   goi <- gene.vector[(i - 1) * 5 + 1:5]
   # if there are less than five elements there will be NAs that need to be removed
   goi <- goi[!is.na(goi)]
   # adding your code 
   summary.list[[i]] <- org.Mm.eg.db::org.Mm.egSYMBOL2EG %>% 
     as.data.frame() %>% 
     filter(gene_id %in% goi) %>% # here the current five genes are selected
     pull(gene_id) %>% 
     rentrez::entrez_summary(id = ., db = "gene") %>%
     { do.call("rbind", .) } %>%
     as.data.frame() %>%
     select(name, description, summary)
Sys.sleep(10) # wait for 10 seconds
}

There will also be a way to do this using dplyr group. In this case the loop is a little easier to debug.

I added Sys.sleep at the end to reduce the frequency with which you are making requests from entrez. If you are posting too many requests in a short time this will be like a DDOS attack. You (and potentially your whole IP range including other people at your institution) could get banned from the service for a couple of hours. Maybe try it with a subset first.

If you really want to get the information for all mouse genes this probably will be rather slow. Alternatively, you can go to the entrez webpage and try to download the information as a list. Unfortunately, I don't know exactly where you will find this information.

score 1 · Answer 2 · answered Jun 02 '21 at 14:00

The following bash script downloads the summaries into an XML file.

#!/usr/bin/env bash
This requires the ncbi tool gene2xml
sudo apt install ncbi-tools-bin
DIR="download_cache"
mkdir -p "$DIR"
URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/Mammalia/Mus_musculus.ags.gz"
wget -S -O - "$URL" | gzip -d | gene2xml -b T -l T | gzip -9 > "$DIR/mm.xml.gz"
URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/Mammalia/Homo_sapiens.ags.gz"
wget -S -O - "$URL" | gzip -d | gene2xml -b T -l T | gzip -9 > "$DIR/hs.xml.gz"

These XML files can be converted to tables along these lines, viz.:


import xml.etree.ElementTree as ET
import gzip
from pathlib import Path
from tcga.utils import mkdir, unlist1, relpath
def parse(src):
    def find_text(e, path):
        e = e.findall(path)
        if e:
            return unlist1(e).text
        else:
            return ""
yield ('gene_id', 'description', 'summary')

with gzip.open(src, mode='r') as fd:
    e: ET.Element
    for (x, e) in ET.iterparse(fd, events=['end']):
        if e.tag == &quot;Entrezgene&quot;:
            gene_id = find_text(e, &quot;**/Gene-track_geneid&quot;)
            descptn = find_text(e, &quot;**/Gene-ref_desc&quot;)
            summary = find_text(e, &quot;Entrezgene_summary&quot;)

            yield (gene_id, descptn, summary)

            e.clear()



def main():
    for src in Path(file).parent.glob("download_cache/.xml.gz"):
        trg = mkdir(Path(file).with_suffix('')) / Path(src.stem).with_suffix('.tsv.gz')
        if trg.is_file():
            print("Skipping existing file:", relpath(trg))
        else:
            print("Writing:", relpath(trg))
            with gzip.open(trg, mode='wt') as fd:
                for t in parse(src):
                    print(t, sep='\t', file=fd)
if name == 'main':
    main()

Btw., a table of vertebrate gene homology is available here (it's just a tab-separated table), and can be used to impute summaries by homology.

Long-form gene summary

2 Answers2

This requires the ncbi tool `gene2xml`

sudo apt install ncbi-tools-bin