0

I have a dataset that is highly imbalanced, talking about 230 cases of class 1 in the target feature, and more than 3800 of class 0. I used SMOTE to resample and then built a logistic regression model. Logistic regression seemed prefect to me for this kind of data since it's a binary target feature. The results aren't much better after applying SMOTE.

Here is the code :

from imblearn.over_sampling import SMOTE
from sklearn.metrics import plot_confusion_matrix

smote = SMOTE() X_train_resampled , Y_train_resampled = smote.fit_resample(X_train,Y_train)

Make predictions using the model

logisticModel = LogisticRegression().fit(X_train_resampled, Y_train_resampled) Y_pred = logisticModel.predict(X_test)

Scoring the model

logisticModel.score(X_test,Y_test)

print(classification_report(Y_test, Y_pred))

Confusion matrix after SMOTE:

              precision    recall  f1-score   support
       0       0.99      0.74      0.85       942
       1       0.11      0.75      0.19        40

accuracy                           0.74       982

macro avg 0.55 0.75 0.52 982 weighted avg 0.95 0.74 0.82 982

Before SMOTE:

              precision    recall  f1-score   support
       0       0.98      0.69      0.81       934
       1       0.11      0.75      0.19        48

accuracy                           0.69       982

macro avg 0.55 0.72 0.50 982 weighted avg 0.94 0.69 0.78 982

So as you can see, the precision hasn't got any better, only the recall, slightly. SMOTE is one of the best metrics and I've used it previously, however in this case I seem to miss something.

What should I do?

0 Answers0