1

I am trying to determine wether my classifier has obtained a statistically significant result. The problem: Classify a heartbeat into 2 classes either "normal" or abnormal"

Let's say I have the following Confusion Matrix Generated for model A

enter image description here

I found this post Comparing two classifier accuracy results for statistical significance with t-test asking a similar question and am using the post by Ébe Isaac to try and solve this however I am running into an issue.

Work so far:

let α = 0.1

Let p1 be the probability of model A

Let p2 be the probability of model B (which always guesses "normal")

In this case p1 = (404 + 132)/(648) = 0.827

p2 = (404)/(648) = 0.623 because p2 always guesses "normal"

Is the result of p1 statistically significant

Ho: p = 0.623

Ha: p > 0.623

From this cross validated post Comparing two classifier accuracy results for statistical significance with t-test asking a similar question by Ébe Isaac

enter image description here

p Hat = (404 + 536)/648 = 1.45

Z = (0.827 - 0.623)/sqrt(2 * 1.45 * (1-1.45)/648)

The issue is that this gives me an error since the square root is negative!

Can someone please help explain why I am getting this error and show me the steps to complete the problem? Thanks!

Sreehari R
  • 141
  • 3

2 Answers2

1

While I am with Tim that there is more to the story than statistical significance, the source of your complex $z$-stat is that you have miscalculated $\hat p$ as $1.45$. As $\hat p$ is a proportion, $\hat p$ cannot exceed $1$. Your $n$ is the total number of classification attempts, which cannot be lower than the total number of correct classification. Since your two models each attempt $648$ classifications, your $n=2\times 648=1296$. With such a value of $n$, you get $\hat p=0.725$, leading to a real $z$-stat.

ADDITIONAL INFO

If you are interested in the topic of comparing machine learning models, I recommend a read of Benavoli et al. (21017). While the authors make a case for Bayesian methods, the paper also goes through more classical techniques.

Finally, note the issues with classification accuracy as a measure of model performance.

Why is accuracy not the best measure for assessing classification models?

Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity

REFERENCE

Benavoli, Alessio, et al. "Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis." The Journal of Machine Learning Research 18.1 (2017): 2653-2688.

Dave
  • 62,186
0

Let's say that you found that your result is "significant" but it is unacceptable from business, or clinical, point of view (it is too small to be safe, it leads to too big loses etc.), what then? Assessing that it is not "bad" does not make the result "good". The hypothetical test would tell you that the result differs from random predictions, but "better then random" usually does not equal "good". What if you used two classifiers and both were significant? Statistical significance is not meant to be a measure of something being "good" or "bad".

Tim
  • 138,066
  • I totally agree! I just want to corroborate the benchmarks I already have with this model by making sure that the accuracy is statistically significant. Do you know where I made a mistake above and how I can fix it? – Sreehari R Feb 19 '18 at 10:18
  • @SreehariR see https://stats.stackexchange.com/questions/192291/variable-selection-using-cross-validated-pls-model-when-permutation-test-shows-l , you can also use bootstrap in here. – Tim Feb 19 '18 at 11:29
  • I am still not sure where I went wrong – Sreehari R Feb 19 '18 at 18:38