2

I am using the kaggle's stroke dataset trying to predict the stroke target feature, according to multiple predictive features. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

The stroke feature has either 1 or 0, so it's great for classification purposes.

I am using logistic regression with the sklearn library. the problem with this dataset is that it is unbalanced. There is approximatly 210 stroke cases (stroke = 1) and 4000 no stroke (stroke = 0).

Here is my code:
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values  # features
Y = data_Enco.iloc[:, 6]  # labels

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

logisticModel = LogisticRegression(class_weight='balanced') logisticModel.fit(X_train, Y_train) # Train the model predictions_log = logisticModel.predict(X_test) print(classification_report(Y_test, predictions_log))

Check out the confusion matrix:

          precision    recall  f1-score   support
 0       0.99      0.66      0.79       935
 1       0.11      0.83      0.19        47

accuracy 0.66 982 macro avg 0.55 0.74 0.49 982 weighted avg 0.95 0.66 0.76 982

The precision is pretty bad for stroke = 1.

How do I fix this?

1 Answers1

0

Use LogisticRegression.predict_proba() to extract the predicted probabilities. Then compare them to a different threshold than the 0.5 that is inexplicably built into LogisticRegression.predict(). Tweak the threshold until you get a precision you are happy with. I very, very much recommend Reduce Classification Probability Threshold.

Note that precision is a highly problematic evaluation metric for all the same reasons accuracy is.

You may also be interested in Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?.

Stephan Kolassa
  • 123,354