Bad metrics results by strong class imbalance in Credit card classification

Question

Hi i'm currently in the process of writing my bachelor's thesis and stuck at a some steps.
I've developed a few ML-Model (XGBoost, (Balanced) Random Forest, ElasticNet,...) on an extreme imbalanced data set (only about 0.2% of the data set belong to positive class). Almost all of my models have the same performance for the metrics that i chose:

ROC AUC 0.77-0.80
Recall for positive class: 0.80-0.88
PR AUC: 0.04 - 0.06
Precision for positive class: 0.01-0.02
Matthews Correlation Coefficient (MCC): 0.13-0.15
Brier Score: 0.11-0.13

I'm quite stressed out that the metrics which are normally sensitive to imbalance in data set are really bad. I have tried some sampling methods including some variations of SMOTE, Undersampling (for which i even implemented a cross-validation script to find out what undersampling rate would be the best) and even tried implementing class weights, but the results don't seem the get better.. If anyone has any suggestion it would mean the world to me! Also one Background information: the model should be a Classifier for credits and there are only two classes: good and bad credits. I've read in some forums that it's ok to have this kind of results if the Recall is more important and false positive (which is normally high due to the imbalance) is not so "expensive". But classifying good credits as bad credits is in fact, bad?
Thank you for reading and i would appreciate all the help!!!
-----------------------------------------------------------------
P.s: I have also want to try out some new metrics for this imbalanced classification problem. The suggested metrics are: Kappa; weighted-averaged Accuracy, F1-Score; macro-averaged Accuracy, F1-Score.
If anyone has a suggestion for metrics that I could use, I'd also appreciate it!

$1)$ Are you sure the imbalance is really a problem? $//$ $2)$ How do you know your performance metrics are so terrible? What context do you have for your Brier score, for instance? For the pure classification metrics like precision, how do they look at thresholds besides the software default? $//$ $3)$ Should you be able to predict accurately? What distinguishes the fraud cases from the legitimate cases? — Dave, Jan 30 '24 at 02:13
@Dave Thank you for your reply, the topics you suggested got me thinking a lot. For 2) I could only say the metrics are so terrible because there are too many negative instances up to the point, that all the models I chose suffer to differ them from the positive class, leading to high number of False Positive, which then leads to poor precision. I've chosen a threshold using the Youden's J Statistics. Other than that threshold seems to worsen the metrics. — user159373, Jan 30 '24 at 06:49
@Dave I could only see Brier Score (BS) as a metric to evaluate the calibration loss. I believe that in the sensitive field where you try to predict the probability of default (PD), this predicted probabilities must be calibrated in order to be call "probabilities". Maybe a bit background Informations: I'm using the "single-loan family-level data set" provided by Freddie Mac for this classification problem. I labeled the Mortgage Loan with $\geq$ 90 Days deliquent as bad and $leq$ 90 Days as good. — user159373, Jan 30 '24 at 06:54

Bad metrics results by strong class imbalance in Credit card classification

0 Answers0