0

I have a binary classification problem and I'm working with an unbalanced dataset. The count for each class in the training set looks like: enter image description here

Training set:
Class 0: 29 cases
Class 1: 6246 cases

Test set: Class 0: 2678 cases Class 1: 12 cases

I applied the under-sampling technique and now there are for the training set:

Class 0: 29 cases
Class 1: 29 cases

After working with the Decision Trees algorithm, these are the obtained results:

Accuracy: 98.85%
Sensitivity: 0.00%
Specifity: 99.55%

The confusion Matrix of the training set:

[[   7   5]
 [  1446 1232]]

The confusion Matrix of the test set:

[[   0   12]
 [  19 2659]]

How I should fix this problem? The train_test_split proportion is 0.3 I should decrease it?

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101, stratify=y)
mkt
  • 18,245
  • 11
  • 73
  • 172
  • 1
    If you split the train/test, why don't you preserve the class ratios? It seems like you create the extreme imbalance. – gunes Dec 01 '20 at 09:47
  • @gunes It seems, I'm working with sklearn, to split it I used the stratify=y, which I understood that preserves the class ratios, Is there another way? – notarealgreal Dec 01 '20 at 09:57
  • My favorite tweet is by our Frank Harrell and is about SMOTE: https://twitter.com/f2harrell/status/1062424969366462473 – Dave Dec 01 '20 at 12:28
  • @Dave using the random oversampling or SMOTE oversampling the results in case of the decision tree algorithm are pretty much the same, I should try Random Forest, Random Tree and some other ensemble alogrithms? I was just expecting a little improvement with the over sampled dataset for training independently of the algorithm – notarealgreal Dec 01 '20 at 12:48
  • https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he – mkt Aug 03 '23 at 12:14

1 Answers1

1

way1: Try to give weight to your minority class. For Random forest you have class_weight. give class_weight = {class_label: class_weight}.

way2: Try to create synthetic data for your minority class. You can use SMOTE to create synthetic data.

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2) 
x_train_, y_train_ = sm.fit_sample(x_train, y_train)
  • I used the second option to, smote oversampling technique implemented with imblearn.over_sampling, but the results are pretty much the same. – notarealgreal Dec 01 '20 at 11:19