1

I'm working on a multi-label classification problem where I want to classify text into 20 categories, and each text may belong to one or multiple categories. Each category is a binary value of 0 or 1, and is highly imbalanced, i.e. the vast majority are 0 and only small portion are 1. I read about Hamming Loss as a common measure for multi-label classifier, so consider below example:

y_pred = np.array([[0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1],[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]])
y_true = np.array([[0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0],[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1],[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]])

from sklearn.metrics import hamming_loss print(hamming_loss(y_true, y_pred))

This would give a Hamming Loss of 0.083, but since the majority are 0, a random guess of all labels as 0 would also give similar performance:

y_pred2 = np.zeros(y_pred.shape)
print(hamming_loss(y_true, y_pred2)) # hamming loss = 0.083

So I'm just wondering for such multi-label classification problem with large number of labels and majority are 0, what would be a proper measure of classifier performance?

crx91
  • 161

1 Answers1

1

Hamming loss is the fraction of labels that are incorrectly predicted. It is thus a generalization to the multi-class situation of (one minus) accuracy, which is a highly problematic KPI in classification.

I would very much recommend that you use probabilistic classifications. These have no problems with an instance being a member of multiple classes - with an appropriate model, you will get one output per possible target class, giving the probability that the instance is a member of that class.

Then assess these classifications using proper scoring rules.

Stephan Kolassa
  • 123,354