I'm participating in a competition of binary classification of disease and I'm using an XGBoost model to classify my data. However, I am experiencing a higher number of false negatives than I would like, leading to bad metric values of the model.
To give you a better understanding, here is part of the preprocessing and modeling code:
def preprocess_data(personal_info_train, measurements_train, personal_info_test, measurements_test):
# ... data merging, handling missing values, encoding categorical columns, etc. ...
Training and evaluation function
def train_and_evaluate(training_data, target_labels, categorical_columns):
preprocessor = make_column_transformer(
(OneHotEncoder(), categorical_columns),
remainder='passthrough'
)
training_features, validation_features, training_labels, validation_labels = train_test_split(training_data, target_labels, test_size=0.2)
scale_pos_weight = len(training_labels[training_labels == 0]) / len(training_labels[training_labels == 1])
model_pipeline = make_pipeline(
preprocessor,
XGBClassifier(n_jobs=2, learning_rate=0.012, n_estimators=1800, max_depth=14, min_child_weight=2, gamma=0.1, subsample=0.7, colsample_bytree=0.8, reg_lambda=0.8, reg_alpha=0.8, random_state=42, scale_pos_weight=scale_pos_weight)
)
(For the full code, you can refer to my GitHub repository.)
I've been trying to adjust hyperparameters, but it's not improving the number of false negatives significantly. Given this, I suspect the problem might be related to the data imbalance which I tried to handle using the scale_pos_weight parameter.
Here are some performance metrics I've been getting:
| Metric | Value |
|---|---|
| Train_AUC | 1 |
| Validation_AUC | 0.980784724 |
| Accuracy | 0.987575243 |
| Precision | 0.910284464 |
| Recall | 0.776119403 |
| F1_Score | 0.837865055 |
| Log_Loss | 0.044203302 |
| MCC | 0.834305991 |
| Balanced_Accuracy | 0.886409404 |
| Confusion_Matrix_TP | 416 |
| Confusion_Matrix_FP | 41 |
| Confusion_Matrix_FN | 120 |
| Confusion_Matrix_TN | 12381 |
Does anyone have any suggestions on how to improve recall or techniques for reducing false negatives in this scenario? I tried applying SMOTE to handle class imbalance in my dataset, but it only resulted in worse performance. Given the data preprocessing steps I've already taken and the unsuccessful application of SMOTE, I'm now seeking alternative methods to improve recall and reduce the number of false negatives. If you've dealt with a similar situation or can share any suggestions, I would really appreciate it!
Any advice would be appreciated!