How likely is it that our model better than random in the upper corner of the AUC?

Question

We're using forest-based models in a personnel selection context. For a dataset with 57 features, 230 observations, and a binary outcome, we got the following ROC curves.

This shows the first 6 folds of a 14-fold cross-validation on the dataset. To me, it looks like good performance from the model on 5 out of 6 folds. We're primarily interested in filtering out the bottom 30%, so in 5 out of 6 folds, the model was able to reduce the group by 30% without sending home qualified individuals.

How likely would this result be if we go to real data? In other words, if we would fit the model on the full dataset and use it on new data for prediction? Does anyone here know research on this problem or know a strategy to simulate it?

Note that we did interpret the model and we do not see much reason to believe that our model will not generalize to new data. We've used the interpretable SIRUS algorithm (https://github.com/rikhuijzer/StableTrees.jl).

score 1 · Answer 1 · answered Jul 25 '22 at 18:10

After you have built the model on the full original data set, you can estimate its generalizability via bootstrapping. The idea, under the bootstrap principle, is that repeated bootstrap sampling from your original data set mimics the process of taking your original data set from the underlying population. Seeing how well models based on bootstrapped samples work on your full data set estimates how well your original model would work on new data from the population.

Generate a model for each of multiple bootstrap samples from the original data; repeat the entire process on each bootstrap sample, including parameter tuning for growing the forest. Then evaluate the performance of each such model on the entire original data set. You can consider each bootstrap sample as its own "training set" and use the original data set as the "test set" for all the models.

For each bootstrap-based model, evaluate how much better it performs on its own "training set" than on the shared "test set." That provides an estimate of the "optimism" in your modeling process. Average the optimism over all of the bootstrap-based models.

Then, to estimate the performance of your original model on new data, you can correct the nominal performance of the original model by that estimate of optimism. That's the optimism bootstrap, described in several pages on this site. Demetri Pananos recommends that here for a binary outcome model, with a link to a blog illustrating it. You might want to consider a measure of performance other than AUC, but the principle of the optimism bootstrap applies to whatever measure you prefer.

How likely is it that our model better than random in the upper corner of the AUC?

1 Answers1