1

I am working on the scoring model and I aim to predict the probability of default. I have, say m, different candidate Logistic Regression models $M_{1}, \dots, M_{m}$ and I would like to choose the best one for prediction of the probability. Assume, that the data set is moderately large.

My approach is the following:

1) Randomly split the data set into Train and Validation Sets, say in proportion 80/20 without replacement.

2) Train each Logistic Regression model $M_{1}, \dots, M_{m}$ using Train Set and compute Areas Under ROC $AUC_{1}, \dots, AUC_{m}$.

3) Re-split the data again and compute the new $AUC_{1}, \dots, AUC_{m}$.(This is, basically, Monte-Carlo Cross Validation.)

Then, I am thinking to make boxplots for $AUC_{1}, \dots, AUC_{m}$ and choose the model $M_{i}$ which performs "better" according to the boxplots.

Is this correct way? Can I perform the same evaluation, but with Gini index? In my opinion it would make sense, but I haven't seen it in the literature. Also, intuitively I am not satisfied with just one split of the date, because every time we split it we get quite different result.

KimMik
  • 73

1 Answers1

0

One major issue with the ROCAUC is that it does not change upon applying monotonic transformations to the predictions. Thus, ROCAUC does not evaluate how accurate the predicted probabilities are (and your goal is to get accurate ones), just how well the two categories have their predictions separated.

Thus, as the comments describe, ROCAUC probably is not the performance metric you want to assess.

However, there are two common performance metrics that do get after the probabilities: Brier score and log loss ("crossentropy loss" in some circles). Both of these are examples of the strictly proper scoring rules that are discussed here.

$$ \text{Brier Score} = \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2\\ \text{Log Loss} = -\dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}}\left( y_i\log(\hat y_i) + (1-y_i)\log(1 - \hat y_i) \right) $$

A reasonable criticism of both the Brier score and log loss is that they are difficult to interpret. To that, I respond:

  1. You are just comparing models, where all that matters is the relative performance instead of any absolute sense of performance.

  2. Without a context, it is quite difficult to decide what qualifies as good performance.

  3. Both the Brier score and log loss can be transformed to give values that are analogous to the $R^2$ of linear regression fame. These are the Efron and McFadden pseudo $R^2$ values, respectively, discussed by UCLA. I also give the equations below.

$$ R^2_{\text{Efron}} = 1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right)\\ R^2_{\text{McFadden}} = 1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i\log(\hat y_i) + (1-y_i)\log(1 - \hat y_i) \right) }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i\log(\bar y) + (1-y_i)\log(1 - \bar y) \right) }\right) $$

Above, the $y_i$ are the true categories, coded as either $0$ or $1$; the $\hat y_i$ are the predicted probabilities of category $1$; and $\bar y$ is the unconditional probability of membership in category $1$ (so just the proportion of $1$s out of all instances).

Finally, I recommend a read of Benavoli et al. (JMLR 2017) for comparing models on multiple sets of out-of-sample comparisons.$^{\dagger}$ While the article makes a case for using Bayesian methods, it also discusses classical approaches.

REFERENCE

Benavoli, Alessio, et al. "Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis." The Journal of Machine Learning Research 18.1 (2017): 2653-2688.

$^{\dagger}$As one of the comments remarks, there are legitimate reasons not to like out-of-sample predictions unless the sample size is quite large.

Dave
  • 62,186