Imbalanced classes and possible ways to increase precision, recall and f1-score of the prediction model

Question

I've just started my data science internship, and this is my first time in the field. I'm sure I'll face challenges in the future where I might need your help. It's also my first time asking a question on Stack Exchange, so I hope the community will be friendly and understanding. I have a binary classification problem of performing and non performing loans of customers. My ultimate goal is to predict whether a customers is performing or non performing when they come to take a loan. My dataset is imbalanced as there are lot of performing customers(74%) than non performing(26%). I have applied logistic regression and as a remedy for the imbalanced data set I have used SMOTE which is a oversampling method but it generates data synthetically and surprisingly give very good prediction accuracy, precision and recall. But as it is not appropriate to oversample in this context my supervisor asked me to apply cost sensitive learning. I have applied cost sensitive learning to both logistic and random forest but logistic regression perform better. Here is the confusion matrix and prediction report of the logistic regression when applied cost sensitive learning.

As you can see here precision and f1-score of non performing customers(1) not satisfying. Also most critical in the context is to identify non performing customers(1). From most methods I used these metrics increased for the majority class(0) but further decrease for the minority class(1). I have seen similar questions but most of them are not clear enough. So I need some advice on what are the possible methods that I can use to overcome the imbalance problem and increase the precision, recall and f1-score of both classes. Thank you in advance!

Please take a look at the following threads: https://stats.stackexchange.com/q/312780/1352 (precision, recall and F1 suffer from precisely the same issues as accuracy), https://stats.stackexchange.com/q/357466/1352 on "unbalanced" data and oversampling such as SMOTE, and, e.g., https://stats.stackexchange.com/q/312119/1352 on including costs into the decision analysis. — Stephan Kolassa, Oct 26 '23 at 05:57

Dave · Accepted Answer · 2023-10-27T18:46:03.960

Like many other claimed “classification” models, logistic regressions return predictions on a continuum. That is, logistic regressions do no classification. The classification comes from a two-step pipeline of getting the continuous predictions from the logistic regression and then applying a decision rule to bucket those continuous predictions into categories, typically by applying a threshold above which the categorical predictions are category $1$ and below which the predictions are category $0$ (or $-1$, depending on how you’ve coded them). Thus, if you’re unhappy with the classification performance, there are two possible culprits: the logistic regression is doing a poor job of distinguishing between the categories or the decision rule is inappropriate despite fine performance by the regression.

Unfortunately, much data science training treats this two-stage pipeline as one stage where the decision stage is fixed as assigning to the most probable category, and all work is for the first stage. While this at first does not sound outrageous, in imbalanced problems, it’s reasonable for the minority category never to be more likely. In this case, such a decision rule will lead to no predictions of the minority category. However, nothing forces you to use that decision rule. There are reasons to look at the direct probability outputs without bucketing them, but seeing how your classification metrics perform over a range of thresholds can help you feel better about how your model is performing (at least in terms of ability to distinguish between the two categories). Maybe your precision, recall, and $F_1$ score are poor at the software-default decision rule, but how are they for other decision rules? You can answer this by looping over many thresholds, bucketing into categories according to these thresholds, and plotting the performance by threshold.

This still misses out on the richness of the probability predictions and what you can do when they are good, but you will learn more about how well the model distinguishes between categories by doing this than you will by using the software-default decision rule. Further, this neither requires you to lie to the model about the data by running techniques like SMOTE or under/overaampling, and you don’t have to fiddle with weighted loss functions whose effectiveness you cannot assess until after you’ve trained the model (though the computing time for a logistic regression is probably not so great to make this unrealistic), as assessing the performance at many thresholds is performed after fitting the model, not before.

I still encourage you to get out of the classification mindset and get into a mindset where the goal is to accurately predict probability, but seeing what happens for many possible second stages of the pipeline at reveal that your ability to classify is not as bad as you thought.

To give an idea of what the probabilities can do for your problem with lending money, the lender doesn’t decide just to lend or not to lend. The lender considers an interest rate, and a person with a higher probability of defaulting would get a loan with a higher interest rate. This is basically how credit scores work: when you have a low credit score, your default probability is assumed to be high and that you are a risk, so your interest rate is high to compensate the lender for taking that risk.

LINKS OF POSSIBLE INTEREST

Predictions vs Decisions [especially Kolassa’s answer]

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Why is accuracy not the best measure for assessing classification models? [Especially Kolassa’s Answer]

Proper scoring rule when there is a decision to make (e.g. spam vs ham email) [Especially Kolassa’s Answer]

Regression on imbalanced and zero-inflated data. How to deal with less frequent values? [This is where I opine that an inability to predict unusual events is expected behavior unless you have some characteristic of those events that distinguishes them from "business as usual".]

How to calculate accuracy of a logistic regression? [Data Science Stack Exchange]

Harrell’s Blog: Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Harrell’s Blog: Classification vs. Prediction

Imbalanced classes and possible ways to increase precision, recall and f1-score of the prediction model

1 Answers1

Linked