Stratifying the performance of a classifier

Question

I have trained a classifier and evaluated my classifier's performance on the testing set by Area Under the Precision-recall curve. My testing set comes from 2000 different categories, and my classifier doesn't consider the input category in the classification. I want to check the performance of my classifier for each category. The issue is that the number of samples in some categories is small; as a result, the category-wise performance report for them won't be statistically significant.

I am looking for an approach that can report category-wise performance along with p-values.

I would appreciate your feedback.

"The issue is that the number of samples in some categories is small; as a result, the category-wise performance report for them won't be statistically significant." What's wrong with that? — Christian Hennig, Dec 27 '22 at 22:32
@ChristianHennig: I have calculated the accuracy without splitting my data into different categories. My null hypothesis for each category: "The model performance on the category of interest is as good as its performance on the whole dataset." I am not sure which statistical test I should use for this. If I haven't used the right words for asking my question, let me know to paraphrase it. — poorya mirzavand, Dec 28 '22 at 14:39
I thought I could look at the instances of each category as a binomial distribution. Let's assume for a specific category, three among eight labels have been predicted wrong. Given the accuracy of the whole dataset, I can calculate how likely it is to have up to 3 mistakes given a sample size of 8. If the calculated probability is too low, I consider it as a category where model performance is lower than the average performance. Does it work? — poorya mirzavand, Dec 28 '22 at 15:42

Stratifying the performance of a classifier

0 Answers0