I have a dataset that is highly imbalanced, talking about 230 cases of class 1 in the target feature, and more than 3800 of class 0. I used SMOTE to resample and then built a logistic regression model. Logistic regression seemed prefect to me for this kind of data since it's a binary target feature. The results aren't much better after applying SMOTE.
Here is the code :
from imblearn.over_sampling import SMOTE
from sklearn.metrics import plot_confusion_matrix
smote = SMOTE()
X_train_resampled , Y_train_resampled = smote.fit_resample(X_train,Y_train)
Make predictions using the model
logisticModel = LogisticRegression().fit(X_train_resampled, Y_train_resampled)
Y_pred = logisticModel.predict(X_test)
Scoring the model
logisticModel.score(X_test,Y_test)
print(classification_report(Y_test, Y_pred))
Confusion matrix after SMOTE:
precision recall f1-score support
0 0.99 0.74 0.85 942
1 0.11 0.75 0.19 40
accuracy 0.74 982
macro avg 0.55 0.75 0.52 982
weighted avg 0.95 0.74 0.82 982
Before SMOTE:
precision recall f1-score support
0 0.98 0.69 0.81 934
1 0.11 0.75 0.19 48
accuracy 0.69 982
macro avg 0.55 0.72 0.50 982
weighted avg 0.94 0.69 0.78 982
So as you can see, the precision hasn't got any better, only the recall, slightly. SMOTE is one of the best metrics and I've used it previously, however in this case I seem to miss something.
What should I do?