Does it matter if real data will be imbalanced, if the ML model was trained on a balanced dataset?

Question

I have trained a machine learning model (supervised, classification, LinearSVC) on a balanced dataset, which produces relatively good results on the test data. I am happy with the numbers, but not confident about the application in real life.

However, in real life, I anticipate the incoming data will be highly imbalanced - that is, up to 99% of values will be coming from class A, and only 1% from class B. Will this somehow affect my model performance, and does this make my results somewhat less reliable?

If that is the case, should I take a different approach?

Yes, it probably matters. Did you get the balanced training data by balancing data? — Dave, Jan 18 '22 at 14:42
Yes, it matters. In particular, any probability estimates will be biased. If the underlying classifier works by estimating probabilities and using an arbitrary cutpoint, then your predictions may suffer. — Demetri Pananos, Jan 18 '22 at 14:48
Yes, I did get a dataset which has the same number of class A and class B points. However, I know in real world this won't be the case.
Is there any way I can address this issue? — zuccinni, Jan 18 '22 at 15:02

score 1 · Accepted Answer · answered Jan 18 '22 at 15:03

Yes, it absolutely does matter. For the SVM the misclassification costs are determined by the regularisation parameter $C$. Most good implementations allow you to have different missclassification costs for each class, $C_-$ and $C_+$. If you balance the dataset, then that is equivalent to changing the misclassification costs (so that a greater penalty is applied for misclassifying examples from the minority class). This means that in operation, your SVM will assign more patterns to the minority class than it should, assuming that the misclassification costs are equal.

I wrote a paper about this back in the mists of ancient history:

G. C. Cawley and N. L. C. Talbot, Manipulation of prior probabilities in support vector classification, In Proceedings of the IEEE/INNS International Joint Conference on Neural Networks (IJCNN-2001), pp. 2433-2438, Washington, D.C., U.S.A., July 15-19 2001. (preprint)

If your dataset is artificially balanced, you can undo this by changing the misclassification costs to down-weight what would be the minority class.

IIRC if you have a 99:1 split in operation and 50:50 in the training set, then $C_+ = 99C$ and $C_-=C$ where $C$ is a regularisation hyper-parameter that you need to tune by e.g. cross-validation to avoid over-fitting.

Essentially the key to imbalanced learning problems is that the minority class is "more important" in some way than the majority class, so you need to work out what the reasonable misclassification costs are and use these in the selection of your regularisation/slack penalty parameters ($C_-$ and $C_+$). In most cases, there is no "class imbalance" problem (providing you have sufficient data), and it really just boils down to a cost-sensitive learning problem, where the imbalance is irrelevant in the sense that the solution is the same as for cost-sensitive learning in datasets with a natural 50:50 split.

Thank you for your answer! In practice, would this approach be better than attempting to train the model on highly imbalanced data? If the latter is even something to consider.
Basically, is training on balanced and then changing misclassification costs, the right way to approach this if I have an option to access a dataset which represents real data with the imbalances. — zuccinni, Jan 18 '22 at 15:09
If you have access to all of the data, then I would retrain on that, with equal misclassification costs (if false-postive and false-negative costs for your application are actually equal!). What you don't want to do is throw away data if training is still within your computational budget. For very large problems you may need to balance and use different misclassification costs to keep the task (including hyper-parameter tuning) computationally feasible, but otherwise it is best just to keep things as simple as possible. — Dikran Marsupial, Jan 18 '22 at 15:18

Does it matter if real data will be imbalanced, if the ML model was trained on a balanced dataset?

1 Answers1