Imbalanced Test Data

Question

I have an imbalanced (1:5) training and test set with only two classes and have oversampled the training set with SMOTE so that the class ratio is 1:1. The ML model gives values over 0.7 for accuracy, precision, recall, and f1 for the training set. However, since the test set is still imbalanced (1:5), the metrics are still above 0.7 but only because it is performing well on the majority class and failing miserably on the minority class (even though it did okay on the training data). Perhaps it is overfitting and not generalizing well to the test set. Currently, it is able to correctly classify around 6% of the minority class in the test data.

Does anybody have any suggestions for building a more robust ML model for document binary classification and, additionally, are there better metrics to use when your test set is imbalanced (i.e., FPR and TPR)?

Generally, in my experience, as along as 1%-2% of the sample is comprised of the minority class, imbalance is not going to be an issue. How does the model do without oversampling? Also what does the in and out sample confusion matrix look like? — Jacob H, Dec 06 '18 at 00:24
There a number of threads on learning from imbalanced data on CV you may want to look at them for suggestions. That being said: 1. 1-to-5 is not really imbalanced, 2. if we choose to significantly alter the proportions of classes in our data we must calibrate our predicted probabilities in the downsampled/upsampled space. — usεr11852, Dec 06 '18 at 00:27
@JacobH: Can you elaborate on what you mean by imbalance is not going to be an issue? It performs about the same without oversampling — slaw, Dec 06 '18 at 01:11
@usεr11852: Can you explain what you mean by "calibrate our predicted probabilities in the down/upsampled space"? Links to other discussions, examples, or references would really help me understand this and why it is important. Thanks! — slaw, Dec 06 '18 at 01:13
@slaw generally speaking, a data set is considered to be imbalanced when the ratio of minority to majority is closer to 1:100, not 1:5. I've seen learners do a good job on data with an imbalance of 1:1000, without explicitly correcting for imbalance. — Jacob H, Dec 06 '18 at 01:26
In my case, the model is only getting about 6% of the minority class correct for the test set and it's probably what I care about more. I might have to adjust the class weights when training the model's loss function. — slaw, Dec 06 '18 at 01:31
@slaw tweaking the class weights is a good idea. Also, you might want to change up your learner, AUC or accuracy around .7 is not very good. If my math is right, your only classifying 85% of you're majority class correctly, which is also not great. Good Luck! — Jacob H, Dec 06 '18 at 06:00

score 2 · Accepted Answer · answered Dec 06 '18 at 08:21

(This started as a comment)

Regarding some good threads already available. I would strongly suggest looking into the threads:

They give a very good idea about the sublimity of the imbalance learning problem. They should help built a better appreciation of the issue because reading bite-sized cook-book suggestions (like the one I will do below) is only a stop-gap measure.

Regarding the calibration of prediction:

If the observed class proportions before re-sampling is say 0.5-to-99.5 and we do a 1% negative downsampling, the observed class proportions in our new sample will become now reflect approximately a 34-to-66 proportion. This is our "downsampled space" where we train the learner. We need to re-calibrate our learner for actual deployment so we get back the 0.5% prediction; that is because in our original space, a 34-to-66 proportion would lead to unreasonably high predicted probabilities. A straightforward way would be to calculate the new probabilities as $q = \frac{p}{p + \frac{1-p}{w}}$ where $p$ is the prediction in downsampled space and $w$ is the the negative downsampling rate. So for example if we predicted $p = 0.5$ in the example above, the actual probability should be more like $q = 0.009901 = (\text{because: } \frac{0.5}{0.5 + 0.5/0.01})$.

Two good first references on the matter are: Dal Pozzolo et al. (2015) Calibrating Probability with Undersampling for Unbalanced Classification and Elkan (2002) The foundations of cost-sensitive learning. (The formula I wrote above is effectively Eq. 3 from Dal Pozzolo's paper.)

Just to be clear: in any classification problem it is far better to focus on assigning costs for misclassification rather than keep hammering about metrics like AUC-ROC, AUC-PR, Cohen's $\kappa$ and the likes. As a real life example: A screening tool and a diagnostic tool serve different purposes so evaluating their utility based on the same metric is probably an oversimplification.

Would the equations or calculations change in the case of upsampling the minority class (as opposed to downsampling the majority class)? — slaw, Dec 06 '18 at 15:02
It does not have to because we can re-express the up-sampling as if down-sampling the negative class. So for the example figures shown, if we generated 10 times as many minority points we could re-express that as down-sampling the majority class by a gentle $w = 0.9569378 = ( \text{because}: 0.995/(10*0.00 5+0.995))/ 0.995)$ — usεr11852, Dec 06 '18 at 16:04

Imbalanced Test Data

1 Answers1

Linked