Get results of keyword search on Pfam via python script

Question

I'm interested in all proteins that are in any way associated with Danio rerio. I decided to look them up at Pfam data base and when I just make a keyword search, I get a a nice list which looks like this http://pfam.xfam.org/search/keyword?query=danio+rerio&submit=Submit. Important thing to note is that most of the associations of these protein families can be found in seq_info column.

My question is: how to extract that list from a python script? Moreover, I want to make a script that when run, will fetch the latest release, filter out the proteins that cannot be found in D. rerio and do some additional processing.

I've been searching for ways to directly download files from Pfam ftp server and mine for the proteins that are associated with D. rerio, however, I didn't find a single way to do it. I simply cannot see where do they store the information about species in which the protein can be found.

I would like to retrieve everything about protein families of interest, because the next step I will need to do is to find the DNA binding proteins and for finding a way to filter out the ones of no interest, I would liketo have as much data as possible.

voiDnyx · Accepted Answer · 2018-08-24T11:53:48.357

1

If I understand your question correctly, then you can use the Pfam proteom search to get the data you are looking for.

Go to the

Pfam website -> Browse -> select the letter D from proteoms category -> D. rerio

There you can find a download link under the section Domain composition:

You can download a list of all regions for this proteome from our FTP site.

The download link seems to be bound to the species ID and if you want to automate your data retrieval you may be able to simply download the file from there in your python script.

edited Aug 24 '18 at 11:53

answered Aug 24 '18 at 11:48

voiDnyx

401
2
12

Exactly what I needed. Although, the species ID is not that informative but I think I can manage that easier in the future. Thank you! – Đorđe Relić Aug 24 '18 at 13:00
One thing to note though is that seq_info at Pfam is retrieved from UniProt database so in order to retrieve that data, one needs to make a mapping from list of proteoms and uniprot.gz file at the Pfam ftp server. The tricky part comes if the two databases are not synced. But that's not an issue that should be discussed here. – Đorđe Relić Aug 24 '18 at 13:06

Get results of keyword search on Pfam via python script

1 Answers1