1

I'm participating in a competition of binary classification of disease and I'm using an XGBoost model to classify my data. However, I am experiencing a higher number of false negatives than I would like, leading to bad metric values of the model.

To give you a better understanding, here is part of the preprocessing and modeling code:

def preprocess_data(personal_info_train, measurements_train, personal_info_test, measurements_test):
    # ... data merging, handling missing values, encoding categorical columns, etc. ...

Training and evaluation function

def train_and_evaluate(training_data, target_labels, categorical_columns): preprocessor = make_column_transformer( (OneHotEncoder(), categorical_columns), remainder='passthrough' )

training_features, validation_features, training_labels, validation_labels = train_test_split(training_data, target_labels, test_size=0.2)

scale_pos_weight = len(training_labels[training_labels == 0]) / len(training_labels[training_labels == 1])

model_pipeline = make_pipeline(
    preprocessor,
    XGBClassifier(n_jobs=2, learning_rate=0.012, n_estimators=1800, max_depth=14, min_child_weight=2, gamma=0.1, subsample=0.7, colsample_bytree=0.8, reg_lambda=0.8, reg_alpha=0.8, random_state=42, scale_pos_weight=scale_pos_weight)
)

(For the full code, you can refer to my GitHub repository.)

I've been trying to adjust hyperparameters, but it's not improving the number of false negatives significantly. Given this, I suspect the problem might be related to the data imbalance which I tried to handle using the scale_pos_weight parameter.

Here are some performance metrics I've been getting:

Metric Value
Train_AUC 1
Validation_AUC 0.980784724
Accuracy 0.987575243
Precision 0.910284464
Recall 0.776119403
F1_Score 0.837865055
Log_Loss 0.044203302
MCC 0.834305991
Balanced_Accuracy 0.886409404
Confusion_Matrix_TP 416
Confusion_Matrix_FP 41
Confusion_Matrix_FN 120
Confusion_Matrix_TN 12381

Does anyone have any suggestions on how to improve recall or techniques for reducing false negatives in this scenario? I tried applying SMOTE to handle class imbalance in my dataset, but it only resulted in worse performance. Given the data preprocessing steps I've already taken and the unsuccessful application of SMOTE, I'm now seeking alternative methods to improve recall and reduce the number of false negatives. If you've dealt with a similar situation or can share any suggestions, I would really appreciate it!

Any advice would be appreciated!

  • 2
    Especially (but not only!) in "unbalanced" cases, recall suffers from the exact same problems as accuracy, see here. It is simply a highly problematic KPI. I understand that you won't be able to have the KPI changed, so the question simply is: what is the target KPI you are judged on? Then use precisely this KPI as a loss function. If you only care about minimizing false negatives, the solution is simple: just classify everything as positive. Zero FN, and you are done. So: what do you want to optimize? – Stephan Kolassa Jun 30 '23 at 15:13
  • 2
    I second the previous suggestion: do not bother with any fancy modeling like XGBoost and just predict everything to be the positive outcome, guaranteeing perfect recall (yes, perfect recall). If this is problematic, perhaps you can say why. – Dave Jun 30 '23 at 15:17
  • To be more clear, the trained model and got a prediction score of 96.5% precent. I want to improve the score and from previous attempts lowering FP and FN numbers improved the final result, so is there any parameter of method to help me achieve this goal? @Dave – knight5478 Jun 30 '23 at 15:25
  • 1
    Yes, lowering the number of false negatives and false positives (mistakes) will improve the accuracy. Do you want to improve the accuracy or the recall? Also, have you read the link about why both accuracy and recall are surprisingly problematic? – Dave Jun 30 '23 at 15:29
  • @Dave just reading, thank you – knight5478 Jun 30 '23 at 15:48

0 Answers0