Yes, it absolutely does matter. For the SVM the misclassification costs are determined by the regularisation parameter $C$. Most good implementations allow you to have different missclassification costs for each class, $C_-$ and $C_+$. If you balance the dataset, then that is equivalent to changing the misclassification costs (so that a greater penalty is applied for misclassifying examples from the minority class). This means that in operation, your SVM will assign more patterns to the minority class than it should, assuming that the misclassification costs are equal.
I wrote a paper about this back in the mists of ancient history:
G. C. Cawley and N. L. C. Talbot, Manipulation of prior probabilities in support vector classification, In Proceedings of the IEEE/INNS International Joint Conference on Neural Networks (IJCNN-2001), pp. 2433-2438, Washington, D.C., U.S.A., July 15-19 2001. (preprint)
If your dataset is artificially balanced, you can undo this by changing the misclassification costs to down-weight what would be the minority class.
IIRC if you have a 99:1 split in operation and 50:50 in the training set, then $C_+ = 99C$ and $C_-=C$ where $C$ is a regularisation hyper-parameter that you need to tune by e.g. cross-validation to avoid over-fitting.
Essentially the key to imbalanced learning problems is that the minority class is "more important" in some way than the majority class, so you need to work out what the reasonable misclassification costs are and use these in the selection of your regularisation/slack penalty parameters ($C_-$ and $C_+$). In most cases, there is no "class imbalance" problem (providing you have sufficient data), and it really just boils down to a cost-sensitive learning problem, where the imbalance is irrelevant in the sense that the solution is the same as for cost-sensitive learning in datasets with a natural 50:50 split.
Is there any way I can address this issue?
– zuccinni Jan 18 '22 at 15:02