4

I am trying to download the _protein.faa.gz files for genomes given their accession numbers through Python. Ideally, I would like to do this without third party libraries. Essentially what I have is a list of just the GCA or GCF accessions. The issue with the ftp site is that it includes the project name along with the accession, which I do not have ahead of time. It would be perfect if I could run downloads from within Python that matched a pattern something like this:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/145/295/GCF_000145295.1_*/*_protein.faa.gz

Is it at all possible to do this? Or is it possible to run something similar via efetch in Biopython?

Thanks in advance!

Grimey
  • 43
  • 3

2 Answers2

3

The above answer is right - here are some more links that describe what you want to do in more detail:

You're going to want to use the BioPython package as it has functions that do exactly what you're looking for.

rimo
  • 963
  • 1
  • 15
1

Concept code only

from Bio import Entrez

Entrez.email = "m@M__" handle = Entrez.efetch(db="protein", id="GCF_000145295", "GCF_000145294", "GCF_000145293", retmode="text",rettype="gb") # check the quotes records = Entrez.parse(handle) handle.close() for record in records: print (record)

You can write a loop, flatten it and then insert it to the id= tag.

mylist = ['GCF_000145295', 'GCF_000145294', 'GCF_000145293']
flattened = ','.join(mylist)
M__
  • 12,263
  • 5
  • 28
  • 47