I have trained a classifier and evaluated my classifier's performance on the testing set by Area Under the Precision-recall curve. My testing set comes from 2000 different categories, and my classifier doesn't consider the input category in the classification. I want to check the performance of my classifier for each category. The issue is that the number of samples in some categories is small; as a result, the category-wise performance report for them won't be statistically significant.
I am looking for an approach that can report category-wise performance along with p-values.
I would appreciate your feedback.