1

I am fairly new to Machine Learning and recently I have built a binary classification model and the model architecture is an MLP with two hidden layers. I am predicting, from a protein sequence, the probability of a residue being mutated ("1") or not ("0"). I had a query regarding the output that we receive after the model predicts the binary classified output.

I can receive the output in two forms from a binary classifier. One is the binary class i.e. "0" or "1" and another is the probability of a residue being a "0" or "1". I wanted to know on what basis the probability is being calculated. For eg, suppose there are 100 students and 4 students pass the exam. Then the probability of passing the exam becomes 4/100 or 0.04. We say that "out of 100 students" 4 passes the exam. Now I have a sequence of amino acids where the total length of the sequence is 700 and my model gives me the probability of each residue being mutated ("1") or not ("0"). Now I get a probability of 0.356 for a particular residue, say at position 10. Now I know that the residue at position 10 has a probability of 0.356 suggesting that it has a 35.6% chance of being mutated. But can I say that "out of 700 residues", the residue at position 10 has a probability of being mutated? Because then the probability will become 1/700 which is less than 0.356.

I get the understanding that the binary classification will have a Bernoulli distribution and every position will have a specific value. My concern is on what basis the probability is being calculated? How does the model calculate the probability of a particular value, say there are total of 100 values?

2 Answers2

1

Then the probability of passing the exam becomes 4/100 or 0.04. We say that "out of 100 students" 4 passes the exam.

This is only one of several definitions of probability. See for example the How exactly do Bayesians define (or interpret?) probability? thread for another one. Probability is a number between 0 and 1 that follows specific axioms, it does not necessarily need to reflect the relative frequencies in the data. The empirical frequencies in not every case can be even calculated, e.g. “probability that Trump gets re-elected a second time” can either happen or not, you cannot count “how often” it happens.

Another thing is probability calibration. Even in cases when you expect the probabilities to reflect the empirical frequencies, some models do it poorly and need to be corrected (see ).

So the safest way to think of the predicted probabilities is as of continuous score, “the higher the better”, where the higher score the more “likely” it is.

Tim
  • 138,066
0

The MLP model outputs the probability as the sum of invidual features gains for each prediction. I had a similiar problem and the best way to ensure the gains as a real probability is plotting the outputs distribution for the test set, you can also transform the outputs using scalers to represent a know probability distribution.