Different precisions in predicting two classes with logistic regression

Question

I am using the kaggle's stroke dataset trying to predict the stroke target feature, according to multiple predictive features. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

The stroke feature has either 1 or 0, so it's great for classification purposes.

I am using logistic regression with the sklearn library. the problem with this dataset is that it is unbalanced. There is approximatly 210 stroke cases (stroke = 1) and 4000 no stroke (stroke = 0).

Here is my code:

X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values  # features
Y = data_Enco.iloc[:, 6]  # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(X_test)
print(classification_report(Y_test, predictions_log))

Check out the confusion matrix:

          precision    recall  f1-score   support
 0       0.99      0.66      0.79       935
 1       0.11      0.83      0.19        47


accuracy                        0.66       982
macro avg      0.55    0.74     0.49       982
weighted avg   0.95    0.66     0.76       982

The precision is pretty bad for stroke = 1.

How do I fix this?

score 0 · Answer 1 · answered Jul 21 '22 at 14:13

0

Use LogisticRegression.predict_proba() to extract the predicted probabilities. Then compare them to a different threshold than the 0.5 that is inexplicably built into LogisticRegression.predict(). Tweak the threshold until you get a precision you are happy with. I very, very much recommend Reduce Classification Probability Threshold.

Note that precision is a highly problematic evaluation metric for all the same reasons accuracy is.

You may also be interested in Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?.

answered Jul 21 '22 at 14:13

Stephan Kolassa

123,354

Thank you sir. Any chance you're familiar with a tutorial that handles this kind of problem using python? I'm having a bit of trouble with the code. – Programming Noob Jul 31 '22 at 13:16
Unfortunately, sorry, no. – Stephan Kolassa Jul 31 '22 at 13:36

Different precisions in predicting two classes with logistic regression

Here is my code:

1 Answers1