Retrieving NCBI Taxa IDs from refseq or GenBank assembly accession

Question

I have about 10,000 genome files all named by either refseq or genbank accession number, do you know if it's possible to convert these numbers to the corresponding NCBI taxon ID or species?

for example:

GCA_000005845.2 to 79781

In the case of E.coli

Edit:

The file names look like this:

GCF_000337855.1_ASM33785v1_protein.faa.gz
GCF_000091125.1_ASM9112v1_protein.faa.gz
GCF_000184535.1_ASM18453v1_protein.faa.gz

My operating system is Ubuntu 16.04

Hi and welcome to the site. Please [edit] your question and show us a few examples of the actual file names. If you have both refseq and genbank accessions, show examples of both. Also, please mention your operating system since some solutions might depend on it. — terdon, Jun 28 '18 at 12:27
Thanks for the edit. So, are your filenames always accession.versionassemblyprotein.faa.gz? — terdon, Jun 28 '18 at 12:35
Yes, although they are all in different folders, for example: /refseq/archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa.gz — Biomage, Jun 28 '18 at 12:38

score 2 · Answer 1 · answered Oct 27 '18 at 02:07

2

You can use eutils to get this information as shown below:

esearch -db assembly -q 'GCF_000337855.1' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,Taxid
GCF_000337855.1 469382

answered Oct 27 '18 at 02:07

vkkodali

1,266
6
9

score 1 · Answer 2 · answered Jul 03 '23 at 23:35

As the question mentions python, here is a python example using biopython.

import json
from Bio import Entrez
def accession2taxid(acc: str, db="nucleotide") -> str:
    handle = Entrez.esearch(db=db, term=acc)
    record = Entrez.read(handle)
    gi = record["IdList"][0]
    handle = Entrez.esummary(db=db, id=gi, retmode="json")
    result = json.load(handle)["result"]
    taxid = result[gi]["taxid"]
    return str(taxid)

So if you want to use an assembly accession, as in your example

>>> accession2taxid("GCA_000005845.2", db="assembly")
'511145'

In your example, you said that GCA_000005845.2 should return 79781, but that Taxon ID is for "uncultured archaeon AOC672". The taxid in the example I just showed returns "Escherichia coli str. K-12 substr. MG1655", which is what you want.

Your question also mentions GenBank accessions, so here is another example

>>> accession2taxid("CU458896.1", db="nucleotide")
'561007'

score 1 · Accepted Answer · answered Jul 02 '18 at 10:44

Turns out I'd already written some code that did it, human memory is a funny thing.

This uses biopython to split the field description to where the species is. May not work for all NCBI files, but seems to work on most.

import Bio
from Bio import SeqIO
from Bio import AlignIO    

for record in SeqIO.parse (FILE, "fasta"):
    Speciesname = record.description.split('[', 1)[1].split(']', 1)[0]

Retrieving NCBI Taxa IDs from refseq or GenBank assembly accession

3 Answers3