2

I have data from 50 human subjects, who are divided into groups A and B (30 participants are in group A and 20 participants in group B). I also have a range of measurements from each subject. I have used a machine learning algorithm (SVM) to predict the subject groups from the features using leave-one-out cross-validation.

I would like to have a p-value for my classifier. If I have understood correctly, I can treat each prediction as a Bernoulli trial, and test for significance using a binomial test. However, I'm a little confused about the details. The paper (pdf) I was following, in the section 'result significance', assumes that the subject groups are of equal size. That's not the case here. What do I need to do to make the test work?

newb
  • 21
  • 1

1 Answers1

0

It sounds like your reference (link is broken) is doing a hypothesis test for if their classification accuracy is higher than $50\%$, as a naïve model should be able to get half of the classifications right (just guess one class every time).

For your problem, a naïve model should be able to predict the correct outcome $60\%$ of the time by predicting the majority class every time. Therefore, the hypothesis test would not be if your model achieves an accuracy above $50\%$ but above $60\%$. It sounds like your link would would have proposed a one-sided, one-sample proportion test for your model proportion classified correctly (accuracy as a proportion (e.g., $0.8$ instead of $80\%$)). Call your proportion classified correctly as $p$.

$$ H_0: p = 0.6\\ H_a: p > 0.6 $$

A z-test might be the most straightforward way to do this. Example S.6.2 at the Penn State University link here gives an example.

Classification accuracy has its issues [1, 2], but if you insist on using it, comparing to the accuracy of a model that always predicts the majority category seems like the reasonable baseline. My answer here gives a justification for this (and another answer to that question gives a reasonable explanation for why a different baseline might be reasonable, under the assumption that classification accuracy is not quite the right measure of model performance).

Dave
  • 62,186