Downloading genomic protein files from accessions in Python

Question

I am trying to download the _protein.faa.gz files for genomes given their accession numbers through Python. Ideally, I would like to do this without third party libraries. Essentially what I have is a list of just the GCA or GCF accessions. The issue with the ftp site is that it includes the project name along with the accession, which I do not have ahead of time. It would be perfect if I could run downloads from within Python that matched a pattern something like this:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/145/295/GCF_000145295.1_*/*_protein.faa.gz

Is it at all possible to do this? Or is it possible to run something similar via efetch in Biopython?

Thanks in advance!

score 3 · Answer 1 · answered Sep 20 '22 at 15:18

The above answer is right - here are some more links that describe what you want to do in more detail:

You're going to want to use the BioPython package as it has functions that do exactly what you're looking for.

M__ · Accepted Answer · 2022-07-04T21:31:14.373

Concept code only

from Bio import Entrez
Entrez.email = "m@M__"
handle = Entrez.efetch(db="protein", id="GCF_000145295", "GCF_000145294", "GCF_000145293", retmode="text",rettype="gb") # check the quotes
records = Entrez.parse(handle)
handle.close()
for record in records:
    print (record)

You can write a loop, flatten it and then insert it to the id= tag.

mylist = ['GCF_000145295', 'GCF_000145294', 'GCF_000145293']
flattened = ','.join(mylist)

Downloading genomic protein files from accessions in Python

2 Answers2