I am using the kaggle's stroke dataset trying to predict the stroke target feature, according to multiple predictive features. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
The stroke feature has either 1 or 0, so it's great for classification purposes.
I am using logistic regression with the sklearn library. the problem with this dataset is that it is unbalanced. There is approximatly 210 stroke cases (stroke = 1) and 4000 no stroke (stroke = 0).
Here is my code:
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values # features
Y = data_Enco.iloc[:, 6] # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(X_test)
print(classification_report(Y_test, predictions_log))
Check out the confusion matrix:
precision recall f1-score support
0 0.99 0.66 0.79 935
1 0.11 0.83 0.19 47
accuracy 0.66 982
macro avg 0.55 0.74 0.49 982
weighted avg 0.95 0.66 0.76 982
The precision is pretty bad for stroke = 1.
How do I fix this?