I'm participating in a bioinformatics machine-learning seminar at my university. The main task is predicting binary classification of protein-protein interactions using sequence data as input.
One of the subtasks is familiarization with the dataset and presenting the dataset. Now I'm wondering which additional information I can get out of the sequence data.
I want to start out with binary classification for the protein interaction with the labels “Interact” and “Non-Interact. The dataset that I got provided consists of two fasta files. One containing the sequence header with the species and the corresponding amino-acid sequences ~300 entries. The other containing the exact same species header and the label “Interact” or "Non-interact”. Species are completely mixed.
Is there something interesting that I can additionally include in my analysis? I'm used to exploratory data analysis though this is different due to the nature of the protein sequences.
These might be completely irrelevant questions since I am not a proteomic person. However, I don't really understand the question as it is right now.
– Kamil S Jaron Oct 14 '18 at 12:33Yup, you get the problem right! I wanna start out with binary classification for the protein interaction with the labels “Interact” and “Non-Interact. The dataset that I got provided consists of two fasta files. One containing the sequence header with the species and the corresponding amino-acid sequences ~300 entries. The other containing the exact same species header and the label “Interact” or "Non-interact”.
Species are completely mixed.
– Olli B. Oct 16 '18 at 20:36