Imbalanced Dataset leads to worse results after using balancing methods

Question

I have a very imbalanced Dataset. It's a binary classification. In the train set I have 150,000 times class 0 and 500 times class 1. That's about 0.33%

When I train a model like DecisionTree I get a f1-score of ~0,011 in several runs.

I've read that I could use methods for balancing an imbalanced dataset. So I did. I used SMOTE, undersampling and oversampling on the train set from the imbalanced-learn API. But the results get worse:

Smote f1-score: 0
oversampling f1-score: 0
undersampling f1-score: 0.001

My procedure summarized:

load data
split data in train (0.7) & test (0.3) set
use one or none of the balancing methods on the train set.
Train Decision Tree and compute f1-score with test set

Looking only at the results, I would prefer to do the parameter optimizing and feature selection without a balancing method.

How do you see that? Do I have an error? Am I wrong? I would appreciate any information. I am still quite a beginner.

Thank you very much.

when you do the train-test split do you use random or stratified splits? That might have a sizaable impact on the latter balancing methods. — Attack68, Oct 13 '20 at 13:26
Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? Do not use accuracy to evaluate a classifier: Why is accuracy not the best measure for assessing classification models? Every critique against accuracy applies equally to the F1 score. — Stephan Kolassa, Oct 13 '20 at 13:39
I make the train-test split manually. The data was measured over time on different people. The data from one person must not be in the training set and the test set, so I don't have an information leakage. For example : Training: Person 1, Person 2, Person 3, and Person 5. Test: Person 4. ||
Okay, so it's seems it is not a problem, that the score is better without any imbalance-method. Thank you. ||

I also checked with roc_auc_score another metrics. But the tendency is the same — SchwarzbrotMitHummus, Oct 13 '20 at 14:45

Imbalanced Dataset leads to worse results after using balancing methods

0 Answers0