Machine learning using protein-sequences

Question

I'm participating in a bioinformatics machine-learning seminar at my university. The main task is predicting binary classification of protein-protein interactions using sequence data as input.

One of the subtasks is familiarization with the dataset and presenting the dataset. Now I'm wondering which additional information I can get out of the sequence data.

I want to start out with binary classification for the protein interaction with the labels “Interact” and “Non-Interact. The dataset that I got provided consists of two fasta files. One containing the sequence header with the species and the corresponding amino-acid sequences ~300 entries. The other containing the exact same species header and the label “Interact” or "Non-interact”. Species are completely mixed.

Is there something interesting that I can additionally include in my analysis? I'm used to exploratory data analysis though this is different due to the nature of the protein sequences.

Hi there, it would be perhaps good to describe the problem bit more. Do I get it right that you have a dataset of amino acid chains and now you want to predict witch do proteins interact? Do you have a gene position of proteins in the genome? Expression data? Do you have species? What is your training dataset?
These might be completely irrelevant questions since I am not a proteomic person. However, I don't really understand the question as it is right now. — Kamil S Jaron, Oct 14 '18 at 12:33
Hi Kamil, thanks for the follow up question!
Yup, you get the problem right! I wanna start out with binary classification for the protein interaction with the labels “Interact” and “Non-Interact. The dataset that I got provided consists of two fasta files. One containing the sequence header with the species and the corresponding amino-acid sequences ~300 entries. The other containing the exact same species header and the label “Interact” or "Non-interact”.

Species are completely mixed. — Olli B., Oct 16 '18 at 20:36
Great, we like juicy details. Could you perhaps [edit] your question and add it there? — Kamil S Jaron, Oct 17 '18 at 07:55
Any specific kind of machine learning algorithm you intend or have to use? — Florian, Oct 29 '18 at 21:41

score 8 · Accepted Answer · answered Oct 15 '18 at 19:30

There are two types of features which can be extracted from protein sequences. It is not possible to known which, if any, of these features would be useful in classification tasks. It may not be possible to build a classifier at all, or it may be very straight forward. To know this, a feature selection technique must be used, such as e.g. forward feature selection.

Sequence based features

Sequence based features are derived from sequence-only information.

The most common used features are the PseAAC (pseudo amino acid composition) style features, including:

Average Position of each amino acid - 20 features
Average Distance between each amino acid - 20 features
Motifs - 20 features, 20*20 features, and/or 20*20*20 features (depending on motif length)
OAAC (Overall Amino Acid Composition) - 20 features
SAAC (Split Amino Acid Composition) - 60+ features
AAIndex Composition - 1 to 567+ features
PCC (Physicochemical Composition) - ~13 features
PSSM encoded as 1, 20, 200, 400 or 600 features (i.e. Psi-Blast output using NR or swissprot)
Average predicted residue burial/exposure level
Average predicted sequence entropy

Structure based features (predicted)

Using the sequence, you can either predict the 3d structure using homology, ab initio, or directly use a homologous structure. You can then extract predicted 3d features, such as:

3d size (i.e. max of x,y,z)
Calculated ASA/SASA
3d contacts (type, frequency, etc)
any other local/global environmental information
cath/scop classification
dssp/stride composition

Machine learning using protein-sequences

1 Answers1