Classification Threshold varies wildly when using ROC curves for threshold moving

Question

I'm trying to do threshold moving to get the appropriate threshold for an imbalanced dataset. I have a 1D timeseries that I am applying a binary transformer-based classifier on. I have:

Training set:
  Total samples: 8133
  Number of 0s: 6930 (85.21%)
  Number of 1s: 1203 (14.79%)
Validation set:
  Total samples: 904
  Number of 0s: 770 (85.18%)
  Number of 1s: 134 (14.82%)
Test set:
  Total samples: 232
  Number of 0s: 198 (85.34%)
  Number of 1s: 34 (14.66%)

NOTE: I am trying to fix the threshold on the VALIDATION SET. I've seen a lot of people using the test set for that but I think that would bias the classifier on the test set. Please correct me if I am wrong here.

I have used sklearn test_train_split and used a stratify=y and passed a random_state as well to get reproducible results. Now the problem is that I am getting wildly varying "best thresholds" when using the ROC-AUC curve - even though the area is approximately the same. I'm trying to do 2 experiments:

Using different methods like: (a) normal training (no weighting to less-freq class), (b) class_weight based training, (c) SMOTE on training set
Training till various num_epochs (varying epoch number)

In (1), I'm getting vastly different results which I can't understand. For example, normal training gives a "best threshold" of 0.01, class_weights approach gives a best threshold of 0.45 and SMOTE gives a best threshold of 0.02.

The huge variation is something I dont understand. To note, for SMOTE, the distribution is the following (only applying SMOTE on trainset):

Training set:
  Total samples: 13860
  Number of 0s: 6930 (50.00%)
  Number of 1s: 6930 (50.00%)
Validation set:
  Total samples: 904
  Number of 0s: 770 (85.18%)
  Number of 1s: 134 (14.82%)
Test set:
  Total samples: 232
  Number of 0s: 198 (85.34%)
  Number of 1s: 34 (14.66%)

Even in the epoch variation, when using class_weights approach all across, and training with 10, 20, 40, 60, 80, 100 epochs, I get a wide range of variation of threshold while trying ROC-AUC analysis on the validation set. All the way from 0.002-0.7 - which is wild and I cant really understand where the problem is with the best threshold. Is there a flaw in logic of how I'm doing things? The following is the code for ROC-AUC that I'm using

def plot_roc_auc_curve(y_true, y_pred_probs, best_threshold):
    # Compute ROC curve and ROC AUC
    fpr, tpr, thresholds = roc_curve(y_true, y_pred_probs)
    roc_auc = auc(fpr, tpr)
# Plot ROC curve
plt.plot(fpr, tpr, lw=1, label='ROC (area = %0.2f)' % (roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.plot(fpr[thresholds == best_threshold], tpr[thresholds == best_threshold], 'ko', label='Best Threshold')
plt.text(fpr[thresholds == best_threshold], tpr[thresholds == best_threshold], f'Best Threshold:{best_threshold:.2f}')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc=&quot;lower right&quot;)
plt.show()

return roc_auc, thresholds


def get_best_threshold(y_true, y_pred_probs):
    fpr, tpr, thresholds = roc_curve(y_true, y_pred_probs)
    best_threshold = thresholds[np.argmax(np.abs(tpr - fpr))]
    return best_threshold

I'm attaching a few screenshots of the ROC-AUC curve with the best threshold plotted as derived from this code. Please help me debug/interpret the results.

EDIT: When I make the threshold 0.01 (for the vanilla case) as suggested by the code, and re-run VALIDATION, I get a lower F1 Score than when it was 0.5

Welcome to Cross Validated. $1)$ Why are you using SMOTE? $//$ $2)$ Do you have enough data to justify the use of a complex model like the transformer-based deep learning you’re using? Based on your ROC curves, which have visually evident edges instead of smooth shapes, you seem to have a fairly small number of observations. $//$ $3)$ What do you want to optimize when you tune the threshold? — Dave, Dec 18 '23 at 13:15
@Dave 1) SMOTE was to try to combat the class imbalance 2) Yes there have been research papers on this data using transformer The number of observations in the dataset is mentioned in the post (divided into training, val, and test sets)

I want to optimize the F1 score

I'm wondering about why the best threshold is so variable in this case — Techie5879, Dec 18 '23 at 13:30
What issue does the class imbalance pose that requires SMOTE? — Dave, Dec 18 '23 at 14:16
@Dave I tried with SMOTE because I figured that I didn't have enough examples of the minority class, as the model was learning all 0s — Techie5879, Dec 18 '23 at 14:49
Learning all $0$s according to what threshold? That you have ROC curves tells me that your model is predicting values on a continuum instead of predicting discrete categories. Thus, how do you get the predictions that are always $0?$ $//$ Related $//$ Related, too — Dave, Dec 20 '23 at 15:21

Classification Threshold varies wildly when using ROC curves for threshold moving

0 Answers0