The validation set includes few positive labels

Question

I'm training a classifer on an unbalanced dataset. The test dataset's positive proportion is 0.02%.

For that reason, the validation data set labels proportions are the same. Because the validation set size is much smaller than the test dataset, it contains less than ten positive labels. The test set includes 25 positive labels. I tune the model hyperparameters by using the F-beta score.

I'm not sure that a sample with less than ten positive labels, is a valid sample for tuning and evaluating the classifer. Indeed, the classifer has terrible results when applied to the validation and test sets. Since the training set is more balanced from the validation and test sets, I can move positive labels from the train set to the validation set (and test set). However, in that way, they will not represent the real data.

What do you recommend me to do?

Why is the training set more balanced than the validation and test sets? — Dave, Nov 03 '22 at 15:49
Is there a reason you don't stratify your train/val/test set creation? Also, consider using repeated CV. — usεr11852, Nov 06 '22 at 00:04
@Dave -Suppose the train includes 0.02 positive labels like the test. The model for sure will fail to classify observations as positive. Therefore, the train includes more positive labels than the test and the val sets. Do you think it is the wrong approach? — Amit S, Nov 06 '22 at 08:25
@usεr11852, why do think that a repeated CV is a good solution for my case? what do you mean exactly when you say to stratify the train/val/test set creation? — Amit S, Nov 06 '22 at 08:30
Yes, I think it is the wrong approach to artificially balance the class ratio in the training data. — Dave, Nov 06 '22 at 08:52
Depending on your model and optimization criteria, the model may learn to generalize well with an imbalanced training set. But this is not always true. See https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data. — PiccolMan, Nov 08 '22 at 02:15

Dave · Answer 1 · 2022-11-11T05:39:06.303

You’ve split the data wrong. Since class imbalance is unlikely to be a problem for your work, even if it appears to be when you use improper scoring rules, there is no need to fiddle with the data. Just split the data, perhaps stratifying to ensure the exact same ratio in both the training-sample and out-of-sample sets.

When you do this, you do not deplete your minority-class samples by artificially balancing the training data, leaving you with plenty of samples for an out-of-sample assessment (especially if you have a lot of data like you have posted is the case for your work).

An alternative to splitting the data is a bootstrap approach. Not everyone agrees with this, with an interesting debate here, and my take on it is that I am torn. However, it is worth knowing that such an approach to validation does exist.

score 0 · Answer 2 · answered Nov 05 '22 at 20:56

If the training data is more balanced, why do you not consider adding more positive samples to the validation and test set from the training set? If training data is also not balanced, maybe using representation learning first to learn discriminative features better and then taking these features to apply classification can be an option.

PiccolMan · Answer 3 · 2022-11-08T02:08:19.597

0

Having a balanced training set is important as the model may otherwise learn to be biased towards the negative class. If it is not possible to have a balanced training set then I suggest you look into adjusting the weight each class has on the error criteria. This can be done in sklearn by using the class_weight model parameter. In turn, giving the model a much larger penalty during training for misclassifying positive examples to make up for the lack of positive examples in the training set.

As for the validation set, 10 positive examples is quite small. Try to increase that to 20 if it's possible. However, having an unbalanced validation and test set is not as much of an issue. You can look at the confusion matrix or other metrics derived from the confusion matrix to validate the performance of the model - https://towardsdatascience.com/performance-metrics-confusion-matrix-precision-recall-and-f1-score-a8fe076a2262.

Furthermore, have you looked into augmenting the data?

edited Nov 08 '22 at 02:08

answered Nov 08 '22 at 01:46

PiccolMan

111

2

Your first sentence is, unfortunately, based on common misconceptions. See (1) and (2) to start down the rabbit hole. – Dave Nov 08 '22 at 01:56
It's possible that the model may learn to generalize well with an imbalanced training set, but having an imbalanced training set is still a sign of concern. See here - https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data. – PiccolMan Nov 08 '22 at 02:07
2

It’s a concern when you use discontinuous improper scoring rules that use $0.5$ as a cutoff for binary classification. When you analyze the rich continuous outputs given by most models, the issues go away except in some fringe cases (which can happen but are very much the exception). – Dave Nov 08 '22 at 02:12

The validation set includes few positive labels

3 Answers3