I'm interested in all proteins that are in any way associated with Danio rerio. I decided to look them up at Pfam data base and when I just make a keyword search, I get a a nice list which looks like this http://pfam.xfam.org/search/keyword?query=danio+rerio&submit=Submit. Important thing to note is that most of the associations of these protein families can be found in seq_info column.
My question is: how to extract that list from a python script? Moreover, I want to make a script that when run, will fetch the latest release, filter out the proteins that cannot be found in D. rerio and do some additional processing.
I've been searching for ways to directly download files from Pfam ftp server and mine for the proteins that are associated with D. rerio, however, I didn't find a single way to do it. I simply cannot see where do they store the information about species in which the protein can be found.
I would like to retrieve everything about protein families of interest, because the next step I will need to do is to find the DNA binding proteins and for finding a way to filter out the ones of no interest, I would liketo have as much data as possible.
seq_infoat Pfam is retrieved from UniProt database so in order to retrieve that data, one needs to make a mapping from list of proteoms anduniprot.gzfile at the Pfam ftp server. The tricky part comes if the two databases are not synced. But that's not an issue that should be discussed here. – Đorđe Relić Aug 24 '18 at 13:06