When dealing with data imbalance, shouldn't we never compare models based on validation loss, or at least weight it?

Question

I know that when validating we are interested in knowing how the model performs in real-world scenarios, so we want the class ratios during validation/test to be the original ones.

Say, however, that we are performing some kind of parameter search/optimization. If we are comparing different possible configurations, I guess we should never use validation loss to compare the models, since if this loss is not weighted, the minority class has little "representation" and therefore we could be choosing a model that is performing good on the majority class but not so good on the minority one. We should instead use a metric that considers both classes equally. Is that right?

I believe this reasoning would not only apply to comparing models but also for lr schedulers that take into account validation metrics. Torch's ReduceLROnPlateau uses a validation metric to adjust learning rate. In their examples they use validation loss but for the same reason I just stated, I believe this might not be the best idea when we have data imbalance.

I know there are posts that somehow answer this but I have not found any that argues about model comparison or lr scheduling using validation loss.

score 7 · Accepted Answer · answered May 17 '23 at 07:20

You should use a loss that accurately reflects the "real world loss" you are trying to minimize by using your model (in the context of subsequent decisions). Then the "problem" disappears, or more precisely, never is a problem.

Suppose you have a rare disease, with an incidence of one in a hundred, but which is fatal. If you use a loss that does not account for the difference in consequences or costs, like accuracy, your model will be tempted to label all instances as negative. However, once you do include a much larger loss if an instance is incorrectly labeled "healthy" (a false negative) than for a false positive etc., you are actually comparing apples to apples, and the rarity of the target class in the validation sample is outweighed by the severity of the costs we incur on these cases by misclassifying them. (Of course, your dataset needs to be large enough so you actually do have some instances of the target class in the validation sample.)

You may find this thread interesting: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

to be clear then, you say it is fine to use a weighted function even for validation, am I right? — raquelhortab, May 17 '23 at 08:06
Yes, exactly. The weighting should simply reflect the real world costs. — Stephan Kolassa, May 17 '23 at 08:11
A possible problem here could be high variance of the validation loss. (Of course, undersampling the majority class won't help, but oversampling the minority class and e.g. importance-weighting might help.) — usul, May 17 '23 at 22:02

score 4 · Answer 2 · answered May 17 '23 at 07:28

Answering you with a question: if not validation loss then what? Certainly, the training metrics won't be any better here. The desirable scenario is that your validation set resembles the real-world data that you will see in prediction time. In such a case, if the real-world data is equally imbalanced, the performance on the validation set would be similar to the prediction time.

If the proportion of the minority group in the validation set is different than in the real-world data, you can use weighted loss, as you noticed. However there are many myths and ways of handling imbalanced data, so you can read in more detail about them in other questions tagged as unbalanced-classes.

Finally, I'm not sure if this is what you mean, but a completely different problem is if what you aim for is having a model that is equally good for all the groups. This is a slightly different problem, since as Simpson's paradox shows, such a model does not need to necessarily work best for all the data, so you would need to decide if you care more about overall performance or the within-group performances. Again, this could be achieved by picking a loss function that reflects the problem you are trying to solve.

"if not validation loss then what?": other metrics, such as AUROC, for instance. Also, you say "If the prop. of the minority group in the validation set is different than in the real-world data..." I don't think you understand what I mean. My question is how would a non-weighted loss be a good loss when comparing models. Let's see it as a higher order of bias. If we got 100 parameter configurations and plan to chose based on the loss, with unbalanced data we could not be paying attention to the minority class at all since high loss on minority cases would not affect much the overall loss. — raquelhortab, May 17 '23 at 08:05
AUROC is just another loss, which you can calculate on the training, the validation or the test data, so calculating AUROC on validation data falls squarely into what Tim and I would recommend (and your post seems to apply to this strategy equally). — Stephan Kolassa, May 17 '23 at 08:29
good point, I guess I did not think about just any metric as a loss! — raquelhortab, May 17 '23 at 19:55

When dealing with data imbalance, shouldn't we never compare models based on validation loss, or at least weight it?

2 Answers2