Does assigning class weights allow for the use of accuracy metrics that require balanced datasets?

Question

If you have an unbalanced dataset, but you assign (inverse) class weights when fitting, does this mean that model loss and accuracy metrics will be computed to allow for using ROC AUC and accuracy, both metrics that require a balanced dataset?

ROC AUC and accuracy metrics can be misleading if you use an imbalanced dataset. You can achieve high accuracy or ROC AUC by simply selecting the majority class all the time. So, to appropriately measure a model's ability, using different metrics like Precision/Recall AUC, etc. might provide a better accuracy metric. Detailed discussion on this topic here

But if you assign class weights while fitting that essentially neutralizes the imbalance, does that mean that the resulting ROC AUC and accuracy metrics can be relied upon?

For example, if your binary classification dataset has a balance of 1:4, but you assign class weights 4:1 while fitting, the model should interpret the minority class with 4x the weight. This should neutralize the impact of class imbalance and allow the use of accuracy metrics that rely upon a balanced dataset.

Is this reasoning sound?

How do you rig ROCAUC to be high by picking the majority class every time? ROCAUC is an evaluation of the probability outputs, regardless of how (or if) we use those probabilities to make hard classifications. // What's wrong with getting a high accuracy score by guessing the majority class every time? // Are you familiar with proper [tag:scoring-rules] and why statisticians do not see class imbalance as an issue? — Dave, Dec 17 '21 at 16:47
Here's why: Imagine you were using a ML model to predict whether a person has brain cancer. You would be very concerned about predicting True Positives and False Negatives, but False Positives are also very important. No one wants to undergo chemo/surgery for a false positive. However, the odds of developing brain cancer is <1%. If I used typical ROC AUC or accuracy, then I could easily get 99% without trying. — user16796559, Dec 17 '21 at 17:30
So the naïve model based on random guessing based on the known class ratio is $99%$ accurate. What's the problem? Do you mean that $99%$ accuracy makes it sounds like you have an $A$-grade model (like an $A$ in school), even though it is just random guessing? — Dave, Dec 17 '21 at 17:59
I'll repeat this for the folks in the back row... "the odds of developing brain cancer is <1%". If your model simply defaults to case 0 (i.e. not having brain cancer) then you can achieve >99% with no effort. The model simply wont learn anything and is useless. — user16796559, Dec 17 '21 at 18:20
So you have to do better than that baseline performance in order to have a useful model. What's the problem? — Dave, Dec 17 '21 at 18:21
Yes. That's the whole point of going about this exercise. 99% accuracy is meaningless if that's baseline. If I can achieve 99.5% accuracy, then that's a substantial improvement. — user16796559, Dec 17 '21 at 18:22
I still don't see the problem. Do you mean that you want some metric that tells you what kind of grade your model gets the way that accuracy in the balanced case tells you that $50%$ is an $\text{F}$ and $99%$ is an $\text{A}?$ // Note that accuracy is problematic when the classes are perfectly balanced, too. — Dave, Dec 17 '21 at 18:27
As opposed to talking around the original question, can we address it directly? Does assigning class weights affect the accuracy calculation to be able to use metrics meant for balanced datasets? — user16796559, Dec 17 '21 at 18:29
You're making an assumption that accuracy is a good metric for balanced datasets. As Kolassa explains in that last link I gave, accuracy is problematic, whether classes are balanced or not. — Dave, Dec 17 '21 at 18:32
What would you recommend as a better metric? I'm assuming that you're meaning "accuracy" as the same thing as "ROC AUC" and "PR AUC", etc... which they aren't. Is there a better metric for quantifying the quality of a model's predictive ability of a binary classification? — user16796559, Dec 17 '21 at 18:55
Did you read Kolassa's answer that I linked? Harrell's blog, linked in the answer, is good to read, too. — Dave, Dec 17 '21 at 18:58
Just read both post thoroughly. Both point to accuracy as defined: TP + TN / (TP + TN + FP + FN). I 100% agree with these posts. In fact, this is the exact reason why other measures such as ROC AUC and PR AUC were created in the first place. Consequently, these posts answer literally nothing of my intended question. — user16796559, Dec 17 '21 at 20:37
So what issue do you see with AUC? I do not see the problem with class imbalance and randomly guessing the prior probability that you see. set.seed(2021); N <- 10000; p <- 0.01; y <- rbinom(N, 1, p); preds <- rep(mean(y), N); my_roc <- pROC::roc(y, preds); my_roc$auc I get an AUC of $0.5$ when I have about a $99:1$ class imbalance and always randomly guess based on the class ratio, indicating that the model is a poor one. — Dave, Dec 17 '21 at 20:58
This is a thoughtful question (+1) and welcome to CV.SE. I think @Dave (+1) correctly questions some of your underlying assumptions regarding the evaluation of AUC-ROC; it is a somewhat big generalisation step to group together AUC-ROC (probability evaluation) and Accuracy (probability thresholding) as far as their usefulness under class imbalance goes. I see it a well-meaning challenge to aim to highlight that actually AUC-ROC still hold quite a bit of information while Accuracy is well.... almost trivial. :) (I tried to expand on your particular question in my post below) — usεr11852, Dec 18 '21 at 14:04

score 3 · Answer 1 · answered Dec 18 '21 at 13:54

3

Assigning class weights does not allow for the use of Accuracy-like metrics that require balanced datasets. That is because strictly speaking no performance metrics requires balance datasets to be calculated - a particular metric (e.g. Accuracy, or Precision) might be almost totally uninformative when applied on a highly imbalanced dataset but that does not mean it doesn't do what it say "on the tin".

When we use class weights based on the relative occurrence rates, we effectively try to assign our misclassification costs such that they balance each other out. That is one approach but not the only approach and it can be argued that it is likely not the best approach either. When working with imbalanced data what happens is that we cannot ignore the fact that the utility of our algorithm (i.e. the decision we will take based on its predictions) will likely not coincide with the abstract metrics we used when performing model fitting. That is a fact of life. Re-weighting our instances might allow us to have a metric that represents our final utility somewhat better but that weight selection needs to be thought carefully rather than be assumed to be correct just because it balances the misclassification costs exactly. If we don't, we one again fall in the traps mentioned in the links provided by Dave on:

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
Why is accuracy not the best measure for assessing classification models? When working with our imbalanced dataset our final utilities might come by the identification of risk scores, anomaly detections or simply direct ranking of the top-$k$ instances.

Finally please note that AUC-ROC and Accuracy are not directly comparable. AUC-ROC does not need to do hard-class assignment (i.e. it doesn't have to make a decision on class class label) to be evaluated, that is in contrast with AUC-ROC/AUC-PR/Brier score/etc. that allow for evaluating the classifiers output directly. (They are some classifiers, like SVMs, that do hard-class assignment by default but I don't consider them in this context)

answered Dec 18 '21 at 13:54

usεr11852

44,125

1

Note that weighting based on misclassification cost could be applied to balanced data sets, too. – Dave Dec 18 '21 at 15:20
Yes, of course (+1). In general, misclassification costs are very often an important (usually business) choice to make. For example, in a "good client" (e.g. clients spending between $500 and $5000 in our website) detection algorithm I once worked, misclassification costs where directly associated with a client's spend. So if we misclassified a client who say bought $600 worth of items during our evaluation period we inquired a misclassification cost of 600 while if we misclassified on who spent $4,000 we inquired a misclassification cost of 4,000. – usεr11852 Dec 18 '21 at 16:17
Thank you 11852. This is helpful. My dataset is unbalanced when stratifying training and validation sets, but not unbalanced over the entire series (i.e. training set is bias with more class zeroes and validation set is bias with significantly more ones). The dataset is a timeseries, so I must preserve the order of training on "older" data and validating on "newer" data. When adjusting for class imbalance in the training set only, I'm having a difficult time interpreting the validation data accordingly. Any suggestions? – user16796559 Dec 19 '21 at 23:22
I am glad, I could help. Please note that in the situation you just described it is even more important to think of misclassification costs. We might have a particular data drift type called "concept drift" where our prediction target changes (or in this case becomes more rare), that means we really need to invest on the generalisation abilities of the selected learner. – usεr11852 Dec 19 '21 at 23:33
This has been difficult to say the least. I've tried a few approaches, which appear to be helping: (1) using a sliding window for training, (2) small batch sizes, (3) significantly reduced the number of epochs the model can train on a particular window before sliding, (4) feature engineering to address drift (e.g. deflating), etc. I suppose much of this is like online-learning, except I didn't freeze layers before sliding, which could be contributing to "catastrophic interference". Have you read any decent papers on the subject? Thank you, this has given me lots to think about. – user16796559 Dec 20 '21 at 14:09
Gama et al. (2014) A survey on concept drift adaptation and Learning in Nonstationary Environments: A Survey (2015) by Ditzler et al. are pretty well cited and will give you good overview of matter. After that we would need to be more application specific. – usεr11852 Dec 21 '21 at 14:37

Does assigning class weights allow for the use of accuracy metrics that require balanced datasets?

1 Answers1