0

Let's say I have a dataset with 100,000 class A training observations and 400 class B training observations. I want to use Support vector machine for this binary classification problem. Instead of applying random undersampling or SMOTE, I want to apply a method as such: I will divide my class A observations into 400 distinct batches (100,000/400=4000). and add all of the 400 class B observations into each of the 400 batches. Then, I will take the average of all the results (accuracy,f1, average precision) obtained from each of the 400 observations.

Is following such a method completely wrong? Does it give me a very optimistic results? Or what are the possible misleading effects?

Thank you.

glslmn
  • 3

1 Answers1

2

As per the method described, you are duplicating the Class B to an extent such that it equals the volume of A. That means the model over learns about the occurrence of Class B and thinks whatever is present in the training set is the real truth. This obviously leads to over fitting, high variance and unstable models. If you are using SVM, please use the class_weight attribute to specify various ranges of the importance of classifying the class B correctly and use cv to identify a specific weight for B. This modifies the optimisation function to severely penalise any class B misclassifications.