Performance metric for small, imbalanced, binary dataset?

Question

I'm training an Elastic Net model on a small dataset with about 100 TRUE outcomes and 15 FALSE outcomes. I've been using AUC to compare models but I'm worried this metric is unstable because some bootstrapped subsamples only have 4 FALSE outcomes in the test set. Is there another metric that would be more appropriate here?

Edit: My Elastic Net models returns numerical predictions

score 1 · Accepted Answer · answered Sep 03 '20 at 14:54

The best method to assess probabilistic predictions (like your out-of-bag predictions) is a proper scoring-rule. AUC is "only" semi-proper. Better choices would be the Brier score or the log score. (Benedetti, 2010, argues for the log score, but I haven't read more than the abstract yet.)

Look at the variability in scores, whether AUC, Brier, or log, is an excellent idea. The problem here is of course that with such a small dataset and apparently different possible models, you will likely not be able to reach firm conclusions about one model being better than another one. No matter which score you use.

Performance metric for small, imbalanced, binary dataset?

1 Answers1