What metric should we optimize for during hyperparameter tuning?
From what I gather from Frank Harrell's article and other related questions here (Reduce Classification Probability Threshold), the problem of classification should be viewed as both a problem of probability prediction and a decision problem. Building a model which can provide good discriminatory strength of the data is a statistical problem, whereas choosing a threshold to assign the label is a decision problem that depends upon the requirements of the decision maker that uses the model. If we want to have high recall, then a lower threshold is better, while a high precision demands a higher threshold.
Since this is the case, does it ever make sense to optimize for threshold dependent metrics such as precision, recall, accuracy and f1 score during hyperparameter tuning? (e.g. In Sklearn's various CV tuning methods, we can choose from a variety of options)
By optimizing for metrics that depends on prediction thresholding during hyperparameter tuning, isn't this explicitly baking in the decision making assumptions in at the model development stage?
If we should only be concerned with model discriminatory strength during model development, then metrics that do not depend on threshold such as AUC-ROC, AUC-PRC should be the (only?) appropriate metrics to optimize for during hyperparameter tuning. Is this idea correct?
does it ever make sense to optimize for threshold dependent metricsIf your bonus getting an extra digit depends on a threshold-dependent metric, I could see an incentive to do so. – Dave Jan 21 '24 at 04:58