0

I have trained a classifier and evaluated my classifier's performance on the testing set by Area Under the Precision-recall curve. My testing set comes from 2000 different categories, and my classifier doesn't consider the input category in the classification. I want to check the performance of my classifier for each category. The issue is that the number of samples in some categories is small; as a result, the category-wise performance report for them won't be statistically significant.

I am looking for an approach that can report category-wise performance along with p-values.

I would appreciate your feedback.

  • "The issue is that the number of samples in some categories is small; as a result, the category-wise performance report for them won't be statistically significant." What's wrong with that? – Christian Hennig Dec 27 '22 at 22:32
  • What null hypothesis do you want to test, and why? – Christian Hennig Dec 27 '22 at 22:32
  • @ChristianHennig: I have calculated the accuracy without splitting my data into different categories. My null hypothesis for each category: "The model performance on the category of interest is as good as its performance on the whole dataset." I am not sure which statistical test I should use for this. If I haven't used the right words for asking my question, let me know to paraphrase it. – poorya mirzavand Dec 28 '22 at 14:39
  • I thought I could look at the instances of each category as a binomial distribution. Let's assume for a specific category, three among eight labels have been predicted wrong. Given the accuracy of the whole dataset, I can calculate how likely it is to have up to 3 mistakes given a sample size of 8. If the calculated probability is too low, I consider it as a category where model performance is lower than the average performance. Does it work? – poorya mirzavand Dec 28 '22 at 15:42

0 Answers0