Evaluating the classifier on K validation sets, but training it on a fixed training set, when data is imbalanced

Question

I'm training a binary classifier on imbalanced data (The real/production data has ~%2 of positive labels). Besides the questionable efficiency of oversampling/undersampling technique, I have a lot of training data, so I can manually add real positive observations to the data, instead of synthesizing them with an oversampling technique. My assumptions, generally based on intuition, are:

The model should be trained on a dataset with more than 2% of positive labels
The test set should be as similar as possible to the real data (in this case, to have the same proportion of positive labels (~2%))
The validation set should be as similar as possible to the test set.

When balancing the training set by manually adding real positive examples (~35% positive label) and then applying CV to this data, I violated my second assumption because the positive label proportion in the validation folds was much bigger than the test set.

Another approach I have tried was splitting the dataset into one train set, one validation set, and one test set (hold-out validation), so all my assumptions were kept. However, in this approach, the validation set (and test set) have few observations of the positive labels (less than 50 observations ), and my concern is that kind of overfitting will occur (the model will know to recognize and classify just these few observations as positive labels and having trouble classify new positive observations).

This process made me think about the following approach: create a fixed training set that includes more proportion of positive labels than the real data, and evaluate the model on K validation sets that have the actual minority label ratio. Similar to CV but with a big difference: the model will be trained on the same training fold each time.

Does my approach may work, or is there a based-on-literature approach to handling this situation?

Why do you have to do anything special to handle the class imbalance? — Dave, Oct 29 '22 at 13:57
To ensure that your model has high accuracy on positive and negative samples (TPR, TNR, etc.), your second approach is sufficient. If your final test set is imbalanced, let it be. By ensuring class balance in the training and the validation sets, you are able to check the model performance robustly and that's all that matters. In this case, your overall accuracy on the test set would depend on your TPR and TNR, scaled by the imbalance. Am I missing something here? — Gautam Sreekumar, Nov 02 '22 at 15:52

score 3 · Answer 1 · answered Nov 03 '22 at 10:16

3

I think your approach defeats the purpose of cross-validation which is that you can use the same samples for training and validation (in different folds) to get the best possible model out of your limited data. If you have a lot of data, it makes more sense to just have one large validation set that is seperate from the training data instead of K small ones.

answered Nov 03 '22 at 10:16

Jannis

310

This seems neither to address the issue of imbalance nor explain why imbalance is unlikely to be an issue in need of addressing. // I’m not even sure that this is correct about Cross Validation, which is (or can be) used to get multiple estimates in order to tune hyperparameters, not (just) to get an assessment of model performance that will punish overfitting. – Dave Nov 03 '22 at 13:28
@Dave It answers the question whether the OP's approach makes sense. Whether class imbalance needs to be addressed depends on the metric you actually care about in the end. Since some metrics are not differentiable, it might not be possible to directly optimize for it and then class imbalance may need to be addressed. – Jannis Nov 03 '22 at 13:36
@Dave To tune hyperparameters you can also use normal validation, cross-validation just has the added benefit of being more data efficient. – Jannis Nov 03 '22 at 13:38
Maybe I don't get the K CV purpose. I thought using k validation sets instead of one, may improve the model generalization, and helps the model not to overfit or memorize the few positive labels in the single validation set. Am I wrong? – Amit S Nov 03 '22 at 13:45
@AmitS Usually you assume that the model can only memorize samples from the training set and only rarely the validation set. Since you only optimize directly for the training set it is unlikely that the training outcomes would be able to memorize a sufficiently large validation set just through the process of model selection unless you use a very exotic training algorithm. Having multiple validation sets does not protect you against overfitting to a specific validation set because you only evaluate your model on one specific validation set at a time but larger validation sets help. – Jannis Nov 03 '22 at 13:52
So why not always use one validation set? what are the cons of K CV? – Amit S Nov 03 '22 at 13:56
1

We use multiple validation sets to allow us to use the same samples for training and validation (in different folds). This allows us to optimize over different training subsets we could possibly take. This way we don't need to decide which data to use for training and which for validation. – Jannis Nov 03 '22 at 13:59

Dave · Answer 2 · 2022-11-06T09:40:13.870

Class imbalance causes tremendous confusion when it really shouldn’t. Briefly, the likely explanation is that you should not do anything about the class imbalance. Do your modeling of the imbalanced data while evaluating your probability predictions using proper scoring-rules. As is discussed within material cited within these links, a major driver of class imbalance seeming like a problem comes from using discontinuous, improper scoring rules like accuracy.

Once you have good probability predictions, you might choose to set a threshold (perhaps not the software-default of $0.5$) to make hard classifications, depending on whether the predicted probability is above or below the threshold. However, when you do real machine learning, there is a cost associated with incorrect classifications, and the decisions that optimize cost might even involve multiple thresholds, despite the original problem being binary.

For the record, I say that even accuracy can be wrestled with to be perfectly descriptive, even in the presence of considerable imbalance, yet accuracy is problematic, even with perfectly balanced classes.

(That Stephan Kolassa in a few of the linked answers has written extensively about this topic, and he is one of the people from whom I learned about this topic.)

Splitting your out-of-sample data into multiple groups could make sense if your goal is to get an idea for the variability of your performance. A competing approach could be to bootstrap that one out-of-sample set and apply your trained model on each of those bootstrap samples. Alternative approaches using bootstrap exist, too, at the expense of computing time, and there are debates about this beyond computing time. However, I think you’ve made a mistake in handling the imbalance before you reach that point.

Evaluating the classifier on K validation sets, but training it on a fixed training set, when data is imbalanced

2 Answers2

Linked

Related