2

I am using doing a binary classification to classify things 0 or 1 using a set of features with LightGBM and XGBoost. Both models give AUC scores roughly in the 0.85s, which seems good. But the $RMSE$ is around 0.32, which is too high, and a negative $R^2$ score of -0.35 on test data which means the features I'm using are terrible at predicting the label.

I think I don't really understand if $RMSE$/$R^2$ is appropriate for binary classifications. Should I just stick with the AUC score or should I be wary of what $RMSE$/$R^2$ says about the model?

  • If you are the same person who posted the duplicate, then please visit https://stats.stackexchange.com/help/merging-accounts to merge your accounts. – whuber Nov 01 '19 at 18:28
  • I can't merge because I initially asked a question as a guest. – pairwiseseq Nov 01 '19 at 18:47
  • I believe you can: that's a regular account and it's associated with the same e-mail address. – whuber Nov 01 '19 at 18:55
  • @whuber sorry, I should've mentioned that I did not use the same email address. – pairwiseseq Nov 01 '19 at 20:35
  • I deleted the old post for you. – whuber Nov 01 '19 at 20:47
  • @pairwiseseq What is the variance of y (0-1) in the test data? > .32^2? How did you compute R2? I assumed as 1-(MSE/var[y]) in the test set, b/c you obtained a value <0. R2<0 indicates the model predicts worse than chance, then AUC should be <.5 (on the test data, but reported AUC might be for training data). Some functions for computing AUC may switch class labels if AUC < .5. Also check the correlation between predicted probabilities and observed (0-1) response in your test set: If <0, overfitting occurred; if >0, population drift has likely occurred between training and test sample. – Marjolein Fokkema Feb 28 '21 at 10:16

3 Answers3

4

I think AUC is more acceptable for binary classifiers. I personally prefer Gini, which is simply just a restatement of the AUC. Gini goes between 0 and 1, whereas AUC goes between 0.5 and 1. RMSE is more acceptable when the target variable is continuous. For example, if you were validating a linear model in-sample through k - fold cross validation, RMSE would be a more suitable metric to assess model performance.

Think about it like this. Since you're constructing a $\textbf{binary}$ classifier, you're interested in how well you can separate two groups; the group of 0's and the group of 1's. AUC and Gini measure how well you can separate these two groups. So to me at least, it seems more appropriate to use AUC and Gini.

ralph
  • 657
  • Gini index is often (at least, in decision-tree methods) only evaluated on training data. It measures the deviation between predicted probabilities and the observed response. If one would want to evaluate it on test data, MSE on predicted probabilities (a.k.a. Brier score) would be a good way to quantify, it also measures the deviation between predicted probabilities and observed response. Both Gini and MSE are based on the variance of a binomial variable. – Marjolein Fokkema Feb 28 '21 at 10:18
  • The Gini index can be evaluated on either training or testing data. It doesn't have to be exclusively used on the training data. – ralph Feb 28 '21 at 23:17
  • Yes, indeed. I am curious on how it would be computed. B/c going by this definition: https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity, all class labels could have switched between training and test data, and the impurity would remain exactly the same. This seems quite ambiguous w.r.t. the quality of the predictions. Perhaps a different computation would be used? I'm not sure, would like to know! – Marjolein Fokkema Mar 01 '21 at 10:30
0

The ROC-AUC has a number of nice statistical properties and is a good metric for binary outcomes. This is what I use most of the time unless there is a huge class imbalance in which case it is the PR curve.

I think you have probabilistic outcomes so people use the Brier score or log likelihood for assessing performance as well. Frank Harrell prefers these approaches because they don’t dichotomise the models output. IE a probabilistic model isn’t the same as a classifier like KNN.

0

$R^2$ only has the usual "proportion of variability explained" in the case of a linear model (and not even all linear models). Further, $R^2$ and $MSE$ give the same information (perhaps depending on how you calculate out-of-sample $R^2$), and $MSE$ and $RMSE$ (obviously) give the same information. Thus, the discussion becomes one of $MSE$ vs $AUC$.

$MSE$ is equivalent to something called Brier score, which is a strictly proper scoring rule. $AUC$ is not a strictly proper scoring rule. Thus, we would prefer the model with the better (lower) Brier score.

Dave
  • 62,186