1

In a multi-label classification problem there a duplicates in the training data that are labelled differently. For example, feature x is the same for both rows, while the corresponding labels y differ.

df = pd.DataFrame({'x': ['text', 'text'], 
                   'y': [[0, 1], [0, 2]]})

Neither of these is necessarily wrong. As the label space is rather large they might just be equally correct (or wrong for that matter).

I assume duplicates with different labels are problematic when learning a model on these data. The model cannot get the prediction for both data points right at the same time. What is the best way to handle this situation? Should I merge both rows by building the union of the two sets of labels?

df = pd.DataFrame({'x': ['text'],
                   'y': [[0, 1, 2]]})

Or are there alternatives?

raywib
  • 11
  • 1
    Welcome to Cross Validated! Could you please expand on why you find this problematic? You can see from the data that the same features can have different outcomes, so shouldn’t the model not be able to get it right every time? – Dave Feb 01 '24 at 14:13
  • 1
    See https://stats.stackexchange.com/questions/602110/how-to-treat-duplicates-while-dealing-with-real-data – kjetil b halvorsen Feb 01 '24 at 14:22
  • 1
    Your model should output conditional class membership probabilities in any case. In this case, you would want predicted probabilities of 0.5 and 0.5. Where is the problem? – Stephan Kolassa Feb 01 '24 at 14:23
  • So maybe I got it wrong and it is not so much a problem for the model but for the metric: A model could output equal probabilities for both label 1 and 2 and that would be satisfactory. But if I evaluated the model on the same data above the loss function would give the impression of rather a bad model. Or am I missing something? – raywib Feb 02 '24 at 07:59
  • Well the model can’t tell which label is correct for that input, can it? No matter what, it will get something wrong, will it not? It seems like, whatever distinguishes those outcomes is missing from your data. “Bad model” is one way to describe that, and your measures of performance should reflect the fact that you lack the desired ability to predict. – Dave Feb 02 '24 at 11:36

1 Answers1

1

It does not make sense to consider this feature to correspond with both outcomes happening simultaneously. The evidence points to either outcome being possible, but nothing suggests that both can happen. If you give this feature an outcome of both labels, you are telling the model to predict that both will happen when this feature occurs, and the data dispute that.

Consider an analogy to a multi-label image problem. Given an image, it could contain either a dog or a wolf: sometimes that animal will turn out to be a dog and other times a wolf. That is a different story than the image containing both a dog and a wolf.

(Perhaps it’s because of the breeds of dogs I’ve had, but I find hard to believe that a dog and a wolf look so similar. Perhaps this is a picture of the animal’s shadow, where a husky’s shadow probably looks like a wolf’s shadow, meaning that the same feature could correspond to either a husky (dog) or a wolf. However, if there’s just one shadow in this particular image, you might not want to predict the presence of two animals.)

Yes, this means that model performance has a limit and that you can’t necessarily determine which outcome will correspond with this feature. Since nothing in your features distinguishes these outcomes, you should struggle. If you look at the “X” below, taken from my answers here and here, you are at a feature analogous to $(0,0)$, where you really don’t know the color (label) to which the feature corresponds. A good model should reflect this ambiguity instead of lie to you about its ability to distinguish between categories when the input is $(0,0)$.

X

Dave
  • 62,186
  • Of course the data does not provide this kind of information but as I said: For this problem it is satisfactory to predict both labels with equal probability whether or not the data suggests that. (The labels in the data are applied by human labellers and agreement is rather low. However, the labels are not strictly incorrect but rather membership is fuzzy.) – raywib Feb 05 '24 at 07:44
  • @raywib What is the problem such an approach hopes to solve? – Dave Feb 06 '24 at 11:19
  • I am not sure if this answers your question: The situation is that there is a set of suitable labels for each data point. However, each labeller might apply only a subset of these. – raywib Feb 16 '24 at 06:26