8

I would like to gather proteins FASTA sequence from Entrez with python 2.7. I am looking for any proteins that have the keywords: "terminase" and "large" in their name. So far I got this code:

from Bio import Entrez
Entrez.email = "example@example.org"


searchResultHandle = Entrez.esearch(db="protein", term="terminase large", retmax=1000)
searchResult = Entrez.read(searchResultHandle)
ids = searchResult["IdList"]

handle = Entrez.efetch(db="protein", id=ids, rettype="fasta", retmode="text")
record = handle.read()

out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

However, it can get me several terminases from various organisms, while I need only terminase form bacteriophages (specificly Viruses [taxid 10239], host: bacteria). I've managed to get the nuccore accession ids from NCBI of the viruses I am intersted in, but I don't know how to combine those two informations. The id file looks like this:

NC_001341
NC_001447
NC_028834
NC_023556
...

Do I need to access every gb file of every ID and search for my desired protein in it?

tahunami
  • 303
  • 2
  • 8
  • Welcome to Bioinformatics! Did you tried to add the " [taxid 10239], host: bacteria" in the search in Entrez? From the retrieved fastas you could also check if the id is of interest for you and then omit those who aren't – llrs Jul 25 '17 at 09:58
  • I tried, but I don't know if my syntax is correct. Should it be: term="terminase large bacteria[organism]" – tahunami Jul 25 '17 at 10:56
  • 1
    You the term search would be the same query you set on the NCBI page, that's '("Viruses"[Organism]) AND (host[All Fields] AND ("Bacteria"[Organism] OR "Bacteria Latreille et al. 1825"[Organism] OR bacteria[All Fields]))' – llrs Jul 25 '17 at 11:01

1 Answers1

4

Found what I was looking for. In:

searchResultHandle = Entrez.esearch(db="protein", term="terminase large", retmax=1000)

I've added:

searchterm = "(terminase large subunit AND viruses[Organism]) AND Caudovirales AND refseq[Filter]"
searchResultHandle = Entrez.esearch(db="protein", term=searchterm, retmax=6000)

which norrowed down my searches to the desired viruses. Granted it's not filtered by host, but by a taxonomy group, but it is enough for my work.

Thank you @Llopis for additional help

tahunami
  • 303
  • 2
  • 8
  • 3
    Please remember to accept this answer. The system won't let you accept your own answer just after posting it, but you should be able to now. – terdon Jul 31 '17 at 10:40