The KPI (Key Performance Indicator) depends on the requirements of the application. For some applications (i.e. those where a hard classification must be made and we know a-priori that the misclassification costs are equal, e.g. some handwritten character recognition tasks) accuracy is a completely reasonable performance metric and it would be a mistake to recommend avoiding it because it has problems as well as advantages.
Similarly, for some applications (primarily information retrieval) where it is more natural to talk of the relative importance of precision and recall than of misclassification costs, then $F_1$ or more generally $F_\beta$ may be appropriate, especially where we need to make a decision ("do I read this article, or don't I?").
An important consideration is whether we need to make a decision. We may well implement the system using a probabilistic classifier, and then applying a threshold. However, if we need a decision, then the performance of the system depends on the setting of that threshold, so we should be using a performance metric that depends on the threshold, as we need to include the effects of the threshold on the performance of the system.
The advice I would give is not to have a single KPI, but have a range of performance metrics that provide information on different aspects of classifier performance. I quite often use accuracy (to measure the quality of the decisions), or equivalently the expected risk where misclassification costs are unequal, the area under the receiver operating characteristic (to measure the ranking of samples) and the cross-entropy (or similar) to measure the calibration of the probability estimates.
Basically, our job as statisticians is to understand the advantages and disadvantages of performance metrics so that we can select the appropriate metric(s) for the needs of the application. All metrics have advantages and disadvantages, and we shouldn't reject any of them a-priori because of their disadvantages if they have advantages or relevance for our application. I think the advantages and disadvantages are well covered in textbooks (even ML ones! ;o), so I would just use those.
Also, as I have said elsewhere, we should make a distinction between performance estimation and model selection. They are not the same problem, and sometimes we should have different metrics for each task.
$$ \begin{pmatrix} 4 & 1\ 2 & 3 \end{pmatrix} \rightarrow \begin{pmatrix} 0.4 & 0.1\ 0.2 & 0.3 \end{pmatrix} $$
– Dave Jan 31 '23 at 20:03