Comparing distributions of generalization performance

Question

Say that I have two learning methods for a classification problem, $A$ and $B$, and that I estimate their generalization performance with something like repeated cross validation or bootstrapping. From this process I get a distribution of scores $P_A$ and $P_B$ for each method across these repetitions (e.g. the distribution of ROC AUC values for each model).

Looking at these distributions, it could be that $\mu_A \ge \mu_B$ but that $\sigma_A \ge \sigma_B$ (i.e. the expected generalization performance of $A$ is higher than $B$, but that there is more uncertainty about this estimation).

I think this is called the bias-variance dilemma in regression.

What mathematical methods can I use to compare $P_A$ and $P_B$ and eventually make an informed decision about which model to use?

Note: For the sake of simplicity, I am referring to two methods $A$ and $B$ here, but I am interested in methods that can be used to compare the distribution of scores of ~1000 learning methods (e.g. from a grid search) and eventually make a final decision about which model to use.

I think the term bias-variance tradeoff does not apply here, because you are not decomposing a mean squared error into a bias and a variance, and you are not talking about the variance of an estimator but about the variance of a score. — Lucas, Jul 21 '13 at 10:48
Thanks @Lucas. I am trying to estimate the score of my classifiers $A$ and $B$ on unseen data. For this, I thought I could take the mean of scores on seen data as my estimators (i.e. $E(P_A)$ and $E(P_B)$ for $A$ and $B$ respectively). Is the variance of these estimators different from the variance of the scores $P_A$ and $P_B$ ? — Amelio Vazquez-Reina, Jul 21 '13 at 16:29
@user815423426 I think the comparison depends on the loss function you have. Diebold and Mariano (2002) have a nice paper studying your question. They proposed some statistical tests comparing the "generalization" performance. I don't know how to set up a link in comments. The paper is: Diebold, Francis X., and Robert S. Mariano. "Comparing Predictive Accuracy." Journal of Business & Economic Statistics 20.1 (2002): 134-144. — semibruin, Jul 22 '13 at 03:24

score 2 · Accepted Answer · answered Oct 16 '13 at 07:36

If there are only two methods, A and B, I would calculate the probability that for an arbitrary training/test partition that the error (according to some suitable performance metric) for model A was lower than the error for model B. If this probability were greater than 0.5, I'd chose model A and otherwise model B (c.f. Mann-Whitney U test?) However, I strongly suspect that will end up choosing the model with the lower mean unless the distributions of the performance statistic are very non-symmetric.

For grid search on the other hand, the situation is a bit different as you are not really comparing different methods, but instead tuning the (hyper-) parameters of the same model to fit a finite sample of data (in this case indirectly via cross-validation). I have found that this kind of tuning can be very prone to over-fitting, see my paper

Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research, 11(Jul):2079−2107, 2010. (www)

I have a paper in review that shows that it is probably best to use a relatively coarse grid for kernel machines (e.g. SVMs) to avoid over-fitting the model selection criterion. Another approach (which I haven't investigated, so caveat lector!) would be to choose the model with the highest error that is not statistically inferior to the best model found in the grid search (although that may be a rather pessimistic approach, especially for small datasets).

The real solution though is probably not to optimise the parameters using grid-search, but to average over the parameter values, either in a Bayesian approach, or just as an ensemble method. If you don't optimise, it is more difficult to over-fit!

Thanks Dikran. When you say "average over the parameter values" I think understand how to do this through an ensemble method (e.g. building the ensemble output as the average of the the classifier outputs), but I am not sure how do this with a Bayesian approach when working with a discriminative model. I understand the theory of a fully Bayesian approach (i.e. avoid point estimates, and marginalize out the parameters to build the final posterior), but, assuming that my prior on the parameters is uniform, wouldn't this be equivalent to building the averaging ensemble? — Amelio Vazquez-Reina, Oct 16 '13 at 13:54
In the Bayesian approach, the models would be weighted by their marginal likelihood (i.e. Bayesian evidence) and any prior placed over the hyper-parameters, so it would be a special case of averaging over an ensemble with a particular method for weighting the models. — Dikran Marsupial, Oct 18 '13 at 12:20

Comparing distributions of generalization performance

1 Answers1

Linked