3

I have about 10,000 genome files all named by either refseq or genbank accession number, do you know if it's possible to convert these numbers to the corresponding NCBI taxon ID or species?

for example:

GCA_000005845.2 to 79781

In the case of E.coli

Edit:

The file names look like this:

GCF_000337855.1_ASM33785v1_protein.faa.gz
GCF_000091125.1_ASM9112v1_protein.faa.gz
GCF_000184535.1_ASM18453v1_protein.faa.gz

My operating system is Ubuntu 16.04

Biomage
  • 173
  • 7
  • Hi and welcome to the site. Please [edit] your question and show us a few examples of the actual file names. If you have both refseq and genbank accessions, show examples of both. Also, please mention your operating system since some solutions might depend on it. – terdon Jun 28 '18 at 12:27
  • Thanks for the edit. So, are your filenames always accession.versionassemblyprotein.faa.gz? – terdon Jun 28 '18 at 12:35
  • Yes, although they are all in different folders, for example: /refseq/archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa.gz – Biomage Jun 28 '18 at 12:38

3 Answers3

2

You can use eutils to get this information as shown below:

esearch -db assembly -q 'GCF_000337855.1' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,Taxid
GCF_000337855.1 469382
vkkodali
  • 1,266
  • 6
  • 9
1

As the question mentions python, here is a python example using biopython.

import json
from Bio import Entrez

def accession2taxid(acc: str, db="nucleotide") -> str: handle = Entrez.esearch(db=db, term=acc) record = Entrez.read(handle) gi = record["IdList"][0] handle = Entrez.esummary(db=db, id=gi, retmode="json") result = json.load(handle)["result"] taxid = result[gi]["taxid"] return str(taxid)

So if you want to use an assembly accession, as in your example

>>> accession2taxid("GCA_000005845.2", db="assembly")
'511145'

In your example, you said that GCA_000005845.2 should return 79781, but that Taxon ID is for "uncultured archaeon AOC672". The taxid in the example I just showed returns "Escherichia coli str. K-12 substr. MG1655", which is what you want.

Your question also mentions GenBank accessions, so here is another example

>>> accession2taxid("CU458896.1", db="nucleotide")
'561007'
Michael Hall
  • 663
  • 4
  • 11
1

Turns out I'd already written some code that did it, human memory is a funny thing.

This uses biopython to split the field description to where the species is. May not work for all NCBI files, but seems to work on most.

import Bio
from Bio import SeqIO
from Bio import AlignIO    

for record in SeqIO.parse (FILE, "fasta"):
    Speciesname = record.description.split('[', 1)[1].split(']', 1)[0]
Biomage
  • 173
  • 7