I am trying to understand the trade-offs between different metrics for evaluating the performance of different classification methods for multi-label data.
One option commonly found in the literature is the "Hamming" Loss, which is defined as the fraction of wrong labels over the total. Another option is to assess the goodness of "probabilistic predictions" for each label using, for example, a log-likelihood loss function.
One trade-off is likely to occur when data are "sparse" (few labels, many zeros), as the Hamming loss is relatively less "sensitive" to different models.
Are there other conditions under which one should choose one or the other?