Say that I have two learning methods for a classification problem, $A$ and $B$, and that I estimate their generalization performance with something like repeated cross validation or bootstrapping. From this process I get a distribution of scores $P_A$ and $P_B$ for each method across these repetitions (e.g. the distribution of ROC AUC values for each model).
Looking at these distributions, it could be that $\mu_A \ge \mu_B$ but that $\sigma_A \ge \sigma_B$ (i.e. the expected generalization performance of $A$ is higher than $B$, but that there is more uncertainty about this estimation).
I think this is called the bias-variance dilemma in regression.
What mathematical methods can I use to compare $P_A$ and $P_B$ and eventually make an informed decision about which model to use?
Note: For the sake of simplicity, I am referring to two methods $A$ and $B$ here, but I am interested in methods that can be used to compare the distribution of scores of ~1000 learning methods (e.g. from a grid search) and eventually make a final decision about which model to use.