Should threshold dependent metrics be used as optimization objectives during hyperparameter tuning for classification models?

Question

What metric should we optimize for during hyperparameter tuning?

From what I gather from Frank Harrell's article and other related questions here (Reduce Classification Probability Threshold), the problem of classification should be viewed as both a problem of probability prediction and a decision problem. Building a model which can provide good discriminatory strength of the data is a statistical problem, whereas choosing a threshold to assign the label is a decision problem that depends upon the requirements of the decision maker that uses the model. If we want to have high recall, then a lower threshold is better, while a high precision demands a higher threshold.

Since this is the case, does it ever make sense to optimize for threshold dependent metrics such as precision, recall, accuracy and f1 score during hyperparameter tuning? (e.g. In Sklearn's various CV tuning methods, we can choose from a variety of options)

By optimizing for metrics that depends on prediction thresholding during hyperparameter tuning, isn't this explicitly baking in the decision making assumptions in at the model development stage?

If we should only be concerned with model discriminatory strength during model development, then metrics that do not depend on threshold such as AUC-ROC, AUC-PRC should be the (only?) appropriate metrics to optimize for during hyperparameter tuning. Is this idea correct?

does it ever make sense to optimize for threshold dependent metrics If your bonus getting an extra digit depends on a threshold-dependent metric, I could see an incentive to do so. — Dave, Jan 21 '24 at 04:58
@Dave, does it actually work better in practice even if you are only interested in a specific threshold dependent metric? For e.g. if I'm only interested in f1 score, would a hyperparameter tuned model optimizing for f1 on validation actually perform better than a model tuned by optimizing on PRC-AUC, and then selecting the best threshold for f1 based on validation? I suspect the PRC-AUC method with proper thresholding would still work better in unseen test sets even if we care only for f1 — Tianxun Zhou, Jan 21 '24 at 06:27
"By optimizing for metrics that depends on prediction thresholding during hyperparameter tuning, isn't this explicitly baking in the decision making assumptions in at the model development stage?" This is precisely the argument I made in my answer to the "threshold" question. I don't know whether I can add anything beyond that. However, you seem to be asking a different question in your comment, no? — Stephan Kolassa, Jan 21 '24 at 10:11

Should threshold dependent metrics be used as optimization objectives during hyperparameter tuning for classification models?

0 Answers0