In a multi-label classification problem there a duplicates in the training data that are labelled differently. For example, feature x is the same for both rows, while the corresponding labels y differ.
df = pd.DataFrame({'x': ['text', 'text'],
'y': [[0, 1], [0, 2]]})
Neither of these is necessarily wrong. As the label space is rather large they might just be equally correct (or wrong for that matter).
I assume duplicates with different labels are problematic when learning a model on these data. The model cannot get the prediction for both data points right at the same time. What is the best way to handle this situation? Should I merge both rows by building the union of the two sets of labels?
df = pd.DataFrame({'x': ['text'],
'y': [[0, 1, 2]]})
Or are there alternatives?
