Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Question

TL;DR

See title.

Motivation

I am hoping for a canonical answer along the lines of "(1) No, (2) Not applicable, because (1)", which we can use to close many wrong questions about unbalanced datasets and oversampling. I would be quite as happy to be proven wrong in my preconceptions. Fabulous Bounties await the intrepid answerer.

My argument

I am baffled by the many questions we get in the unbalanced-classes tag. Unbalanced classes seem to be self-evidently bad. And oversampling the minority class(es) is quite as self-evidently seen as helping to address the self-evident problems. Many questions that carry both tags proceed to ask how to perform oversampling in some specific situation.

I understand neither what problem unbalanced classes pose, nor how oversampling is supposed to address these problems.

In my opinion, unbalanced data do not pose a problem at all. One should model class membership probabilities, and these may be small. As long as they are correct, there is no problem. One should, of course, not use accuracy as a KPI to be maximized in a classification problem. Or calculate classification thresholds. Instead, one should assess the quality of the entire predictive distribution using proper scoring-rules. Tetlock's Superforecasting serves as a wonderful and very readable introduction to predicting unbalanced classes, even if this is nowhere explicitly mentioned in the book.

What problem does oversampling, undersampling, and SMOTE solve? IMO, this question does not have a satisfactory answer. (Per my suspicion, this may be because there is no problem.)
When is unbalanced data really a problem in Machine Learning? The consensus appears to be "it isn't". I'll probably vote to close this question as a duplicate of that one.

IcannotFixThis' answer, seems to presume (1) that the KPI we attempt to maximize is accuracy, and (2) that accuracy is an appropriate KPI for classification model evaluation. It isn't. This may be one key to the entire discussion.

AdamO's answer focuses on the low precision of estimates from unbalanced factors. This is of course a valid concern and probably the answer to my titular question. But oversampling does not help here, any more than we can get more precise estimates in any run-of-the-mill regression by simply duplicating each observation ten times.

What is the root cause of the class imbalance problem? Some of the comments here echo my suspicion that there is no problem. The single answer again implicitly presumes that we use accuracy as a KPI, which I find unsatisfactory.
Are there Imbalanced learning problems where re-balancing/re-weighting demonstrably improves accuracy? is related, but presupposes accuracy as an evaluation measure. (Which I argue is not a good choice.)

Summary

The threads above can apparently be summarized as follows.

Rare classes (both in the outcome and in predictors) are a problem, because parameter estimates and predictions have high variance/low precision. This cannot be addressed through oversampling. (In the sense that it is always better to get more data that is representative of the population, and selective sampling will induce bias per my and others' simulations.)
Rare classes are a "problem" if we assess our model by accuracy. But accuracy is not a good measure for assessing classification models. (I did think about including accuracy in my simulations, but then I would have needed to set a classification threshold, which is a closely related wrong question, and the question is long enough as it is.)

An example

Let's simulate for an illustration. Specifically, we will simulate ten predictors, only a single one of which actually has an impact on a rare outcome. We will look at two algorithms that can be used for probabilistic classification: logistic-regression and random-forests.

In each case, we will apply the model either to the full dataset, or to an oversampled balanced one, which contains all the instances of the rare class and the same number of samples from the majority class (so the oversampled dataset is smaller than the full dataset).

For the logistic regression, we will assess whether each model actually recovers the original coefficients used to generate the data. In addition, for both methods, we will calculate probabilistic class membership predictions and assess these on holdout data generated using the same data generating process as the original training data. Whether the predictions actually match the outcomes will be assessed using the Brier score, one of the most common proper scoring rules.

We will run 100 simulations. (Cranking this up only makes the beanplots more cramped and makes the simulation run longer than one cup of coffee.) Each simulation contains $n=10,000$ samples. The predictors form a $10,000\times 10$ matrix with entries uniformly distributed in $[0,1]$. Only the first predictor actually has an impact; the true DGP is

$$ \text{logit}(p_i) = -7+5x_{i1}. $$

This makes for simulated incidences for the minority TRUE class between 2 and 3%:

Let's run the simulations. Feeding the full dataset into a logistic regression, we (unsurprisingly) get unbiased parameter estimates (the true parameter values are indicated by the red diamonds):

However, if we feed the oversampled dataset to the logistic regression, the intercept parameter is heavily biased:

Let's compare the Brier scores between models fitted to the "raw" and the oversampled datasets, for both the logistic regression and the Random Forest. Remember that smaller is better:

In each case, the predictive distributions derived from the full dataset are much better than those derived from an oversampled one.

I conclude that unbalanced classes are not a problem, and that oversampling does not alleviate this non-problem, but gratuitously introduces bias and worse predictions.

Where is my error?

A caveat

I'll happily concede that oversampling has one application: if

we are dealing with a rare outcome, and
assessing the outcome is easy or cheap, but
assessing the predictors is hard or expensive

Let me emphasize that this is about oversampling in the data collection phase, emphatically not about taking an already-collected dataset and discarding data (undersampling) or duplicating data (oversampling).

A prime example would be genome-wide association studies (GWAS) of rare diseases. Testing whether one suffers from a particular disease can be far easier than genotyping their blood. (I have been involved with a few GWAS of PTSD.) If budgets are limited, it may make sense to screen based on the outcome and ensure that there are "enough" of the rarer cases in the sample.

However, then one needs to balance the monetary savings against the losses illustrated above - and my point is that the questions on unbalanced datasets at CV do not mention such a tradeoff, but treat unbalanced classes as a self-evident evil, completely apart from any costs of sample collection.

R code

    library(randomForest)
    library(beanplot)
nn_train &lt;- nn_test &lt;- 1e4
n_sims &lt;- 1e2

true_coefficients &lt;- c(-7, 5, rep(0, 9))

incidence_train &lt;- rep(NA, n_sims)
model_logistic_coefficients &lt;- 
     model_logistic_oversampled_coefficients &lt;- 
     matrix(NA, nrow=n_sims, ncol=length(true_coefficients))

brier_score_logistic &lt;- brier_score_logistic_oversampled &lt;- 
  brier_score_randomForest &lt;- 
brier_score_randomForest_oversampled &lt;- 
  rep(NA, n_sims)

pb &lt;- winProgressBar(max=n_sims)
for ( ii in 1:n_sims ) {
    setWinProgressBar(pb,ii,paste(ii,&quot;of&quot;,n_sims))
    set.seed(ii)
    while ( TRUE ) {    # make sure we even have the minority 
                        # class
        predictors_train &lt;- matrix(
          runif(nn_train*(length(true_coefficients) - 1)), 
              nrow=nn_train)
        logit_train &lt;- 
         cbind(1, predictors_train)%*%true_coefficients
        probability_train &lt;- 1/(1+exp(-logit_train))
        outcome_train &lt;- factor(runif(nn_train) &lt;= 
                 probability_train)
        if ( sum(incidence_train[ii] &lt;- 
           sum(outcome_train==TRUE))&gt;0 ) break
    }
    dataset_train &lt;- data.frame(outcome=outcome_train, 
                      predictors_train)

    index &lt;- c(which(outcome_train==TRUE),  
      sample(which(outcome_train==FALSE),   
            sum(outcome_train==TRUE)))

    model_logistic &lt;- glm(outcome~., dataset_train, 
                family=&quot;binomial&quot;)
    model_logistic_oversampled &lt;- glm(outcome~., 
          dataset_train[index, ], family=&quot;binomial&quot;)

    model_logistic_coefficients[ii, ] &lt;- 
           coefficients(model_logistic)
    model_logistic_oversampled_coefficients[ii, ] &lt;- 
      coefficients(model_logistic_oversampled)

    model_randomForest &lt;- randomForest(outcome~., dataset_train)
    model_randomForest_oversampled &lt;- 
      randomForest(outcome~., dataset_train, subset=index)

    predictors_test &lt;- matrix(runif(nn_test * 
        (length(true_coefficients) - 1)), nrow=nn_test)
    logit_test &lt;- cbind(1, predictors_test)%*%true_coefficients
    probability_test &lt;- 1/(1+exp(-logit_test))
    outcome_test &lt;- factor(runif(nn_test)&lt;=probability_test)
    dataset_test &lt;- data.frame(outcome=outcome_test, 
                     predictors_test)

    prediction_logistic &lt;- predict(model_logistic, dataset_test, 
                                    type=&quot;response&quot;)
    brier_score_logistic[ii] &lt;- mean((prediction_logistic - 
           (outcome_test==TRUE))^2)

    prediction_logistic_oversampled &lt;-      
           predict(model_logistic_oversampled, dataset_test, 
                    type=&quot;response&quot;)
    brier_score_logistic_oversampled[ii] &lt;- 
      mean((prediction_logistic_oversampled - 
            (outcome_test==TRUE))^2)

    prediction_randomForest &lt;- predict(model_randomForest, 
        dataset_test, type=&quot;prob&quot;)
    brier_score_randomForest[ii] &lt;-
      mean((prediction_randomForest[,2]-(outcome_test==TRUE))^2)

    prediction_randomForest_oversampled &lt;-   
                     predict(model_randomForest_oversampled, 
                              dataset_test, type=&quot;prob&quot;)
    brier_score_randomForest_oversampled[ii] &lt;- 
      mean((prediction_randomForest_oversampled[, 2] - 
            (outcome_test==TRUE))^2)
}
close(pb)

hist(incidence_train, breaks=seq(min(incidence_train)-.5, 
        max(incidence_train) + .5),
  col=&quot;lightgray&quot;,
  main=paste(&quot;Minority class incidence out of&quot;, 
                nn_train,&quot;training samples&quot;), xlab=&quot;&quot;)

ylim &lt;- range(c(model_logistic_coefficients, 
               model_logistic_oversampled_coefficients))
beanplot(data.frame(model_logistic_coefficients),
  what=c(0,1,0,0), col=&quot;lightgray&quot;, xaxt=&quot;n&quot;, ylim=ylim,
  main=&quot;Logistic regression: estimated coefficients&quot;)
axis(1, at=seq_along(true_coefficients),
  c(&quot;Intercept&quot;, paste(&quot;Predictor&quot;, 1:(length(true_coefficients) 
         - 1))), las=3)
points(true_coefficients, pch=23, bg=&quot;red&quot;)

beanplot(data.frame(model_logistic_oversampled_coefficients),
  what=c(0, 1, 0, 0), col=&quot;lightgray&quot;, xaxt=&quot;n&quot;, ylim=ylim,
  main=&quot;Logistic regression (oversampled): estimated 
          coefficients&quot;)
axis(1, at=seq_along(true_coefficients),
  c(&quot;Intercept&quot;, paste(&quot;Predictor&quot;, 1:(length(true_coefficients) 
         - 1))), las=3)
points(true_coefficients, pch=23, bg=&quot;red&quot;)

beanplot(data.frame(Raw=brier_score_logistic, 
        Oversampled=brier_score_logistic_oversampled),
  what=c(0,1,0,0), col=&quot;lightgray&quot;, main=&quot;Logistic regression: 
         Brier scores&quot;)
beanplot(data.frame(Raw=brier_score_randomForest, 
  Oversampled=brier_score_randomForest_oversampled),
  what=c(0,1,0,0), col=&quot;lightgray&quot;, 
          main=&quot;Random Forest: Brier scores&quot;)

I've got more or less the same question: https://stats.stackexchange.com/questions/285231/what-problem-does-oversampling-undersampling-and-smote-solve ! — Matthew Drury, Jul 16 '18 at 21:41
I've also ran the same simulation, with an even wider selection of models, and a wider range of prior class probabilities, and observed the same results. Additionally, if you measure the AUC of your models, you'll notice that they are all the same, regardless of the class balance of your training data. I wonder about the source of this wide conception on the evils of class balance, where did it come from, how did we get to this point? — Matthew Drury, Jul 16 '18 at 21:44
And: the question isn't how we got to this point, but how do we get away from it??? — Stephan Kolassa, Jul 16 '18 at 21:45
Agreed, but I still think the "how did we get here" problem is interesting! — Matthew Drury, Jul 16 '18 at 21:46
I just see that I had already upvoted your question. And Tim's question you link to. I am getting old. Or it may be the alcohol. — Stephan Kolassa, Jul 16 '18 at 21:49
Honestly, knowing there is someone else out there that is mystified by the endless class balancing questions is comforting. — Matthew Drury, Jul 16 '18 at 21:52
"How did we get here?" is a great question. I don't know the definitive answer. But my hunch is that this all started when the machine learning community was only concerned with accuracy. Eventually someone pointed out that stupidly high accuracy can be achieved if (1) your classes are severely imbalanced and (2) you predict the majority class. Instead of measuring model quality with a metric other than accuracy, oversampling/SMOTE/etc were all invented to "solve" this problem. This isn't a history, just a story I made up based on my impressions and observable evidence. — Sycorax, Jul 16 '18 at 21:55
@Sycorax That is also my take on the tragedy. Combined with a lot of inherited wisdom digested without reflection. — Matthew Drury, Jul 16 '18 at 21:57
Is "how did we get here" a great question in the "Yes, ask that question on CV" sense, or in the, "I idly wonder about that also" sense? — Matthew Drury, Jul 16 '18 at 21:58
I imagine that unbalance data can screw up estimates for variance or dispersion. But maybe I am thinking with the wrong, non-machine-learning, perspective. So I look up the wikipedia https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis and don't they describe a case which is more like 'if your predictor1 would not be sampled from the true population in an unbalanced (possibly biased) way'? — Sextus Empiricus, Jul 16 '18 at 21:58
The fixation on accuracy is reflected in some software, too, like Breiman's randomForest having built-in method to measure OOB accuracy but no other metric. This has some depressing consequences: people will tune the number of trees in a random forest for no reason other than they've been systematically mislead into thinking that doing so is meaningful. More discussion on this point wrt random forest: https://stats.stackexchange.com/questions/348245/do-we-have-to-tune-the-number-of-trees-in-a-random-forest — Sycorax, Jul 16 '18 at 22:01
I think a large part of this comes from "big data." For rare events, you need a lot of data, and perhaps before (say, 20 years ago), we saw less class imbalance because you'd have laughably few positive examples in your dataset, hence wouldn't even try using it. Nowedays you might easily have a dataset with millions of rows and say, a few hundred positive examples. — Alex R., Jul 16 '18 at 22:04
@StephanKolassa It is a bit difficult to find it in the code and the text, but are you comparing in those plots unbalanced data versus oversampled unbalanced data, or are you comparing balanced data versus oversampled unbalanced data? — Sextus Empiricus, Jul 16 '18 at 22:09
@Sycorax Don't get me started on the sklearn's developers decision to map the predict method on models to the hard decision rule thresholding the probabilities at 0.5. — Matthew Drury, Jul 16 '18 at 22:11
@MartijnWeterings: I'm comparing an unbalanced sample of $n=10^4$ against a balanced subsample, which I get by taking all the (2-3% minority classes) plus an equal number sampled from the majority class. — Stephan Kolassa, Jul 16 '18 at 22:11
@MatthewDrury: For even more hair-pulling, try obtaining confidence intervals for a logistic regression in sklearn (hint: you can't). — Alex R., Jul 16 '18 at 22:13
@MatthewDrury: R's predict.randomForest() does the same by default, though you can at least specify type="prob". — Stephan Kolassa, Jul 16 '18 at 22:13
In my experiments I build some samplers where I could adjust the prior class probabilities. I sampled datasets at 25 values of the prior class probabilities from 0.5 up to 0.99. Then I split into train and test, fit models to the train, and evaluated on the test. Using the AUC, there was no degradation in performance. I selected the AUC since it has the same baseline value regardless of prior class probability (while say, log-loss, changes baseline). I did this over many parameters in the sampler, which changed the structure of the data considerable, and for many different models. — Matthew Drury, Jul 16 '18 at 22:14
Your caveats are definitely prime examples of the need for over/under sampling. A good example is running word embeddings (say, Word2Vec), where there are massive class imbalances and rare occurances, which would otherwise get washed away without correcting for sampling. Keep in mind that oversampling doesn't only improve models in terms of their accuracy, but more importantly it speeds up model training especially for non-convex optimizations. — Alex R., Jul 16 '18 at 22:15
@StephanKolassa but is that really balancing when 2-3% minority class is the true representation of the population? Is balanced vs unbalanced about whether or not the groups are all equal numbered or whether the groups are representing the population? The more interesting case would be to see what happens when you increase the weights of the majority class in the under-sampled samples in order to get back to a proper representation. — Sextus Empiricus, Jul 16 '18 at 22:27
@MartijnWeterings To the vast majority of users of this site asking about class balancing, it's about having the positive and negative classes equally represented, which takes them AWAY from the population representation. — Matthew Drury, Jul 16 '18 at 22:29
This is really a duplicate of @MatthewDrury's question. However, that one did not get a satisfactory answer so +1. Maybe you should answer each other's questions with your simulations :-) Stephan, regarding your simulation: I don't understand where the bias in the oversampled results comes from. Why is the intercept estimate biased and why is the Brier score worse? I'd naively expect oversampling to not matter on average, at least in this setting. — amoeba, Jul 16 '18 at 22:31
@amoeba: Why wouldn't it change it? If anything oversampling distorts the original data distribution so it is "expected" that the baseline (i.e. the intercept) is shifted. If anything when reading the post I immediately thought "Yeah, obvious... Tell me about Predictor 1" — usεr11852, Jul 16 '18 at 22:43
@amoeba, you are changing the bias if you shift representation of the groups. The predictors are supposed to represent some probability for the one or other class (which should be 2-3% versus 97-98% and not 50-50%). To me this is a false idea of balancing. Balancing is correct when done correct. If one thing then this example actually shows how unbalanced (not 50-50, but instead unbalanced from that different interpretation) are indeed problematic because they create bias. — Sextus Empiricus, Jul 16 '18 at 22:46
If you increase the one class from it's true 2-3% to 50% then certainly the baseline may go up. — Sextus Empiricus, Jul 16 '18 at 22:49
@usεr11852 You are right. I was too quick to post my comment. However, here is what I had in mind: it is indeed no surprise that Brier score after oversampling is worse and also that the coefficients are wrong. As you say, this is because oversampling explicitly changes the baseline probability. But if somebody is doing oversampling then they are not interested in the correct probabilistic predictions. They are probably interested in accuracy. So, a question to Stephan: does the accuracy (conditioned on class 1 and 2) become lower for model_logistic_oversampled compared to model_logistic? — amoeba, Jul 17 '18 at 05:10
@amoeba: for the sake of this question I'd argue that one of the problems with accuracy (besides not being a proper scoring rule) is that there isn't one accuracy in the sense that unless sensitivity and specificity happen to be the equal, accuracy depends on the relative class frequencies. We thus get a "self-fullfilling prophecy": someone who oversamples for training will typically not look at accuracy at the natural relative frequencies but at accuracy for similarly oversmapled data. Thus, training and verifying a model that proper validation would deem irrelevant. — cbeleites unhappy with SX, Sep 24 '19 at 08:45
@StephanKolassa Thanks for this post, but I want to add a few more caveats to your point about oversampling. In some cases, the collected sample is not representative of the population; we may care more about the minority class (anomaly detection); or we simply may not wish to replicate the population. An example of the latter that comes to mind is the incident with Amazon's hiring AI which turned out to be sexist, presumably because it was trained on their already sexist database of employees. In such cases, it makes all the sense to rebalance your dataset to train your algorithms. — Michael, Jan 23 '21 at 13:11
What does this sentence mean "One should model class membership probabilities, and these may be small" ? — Minsky, Jan 25 '21 at 19:20
@Minsky: in a two-class problem, what we should be interested in is the probability for an instance to belong to class A or B, conditional on the predictor values for that instance. In an "unbalanced" problem, these probabilities are small. But they may still be influenced by predictors. For example, the probability of defaulting on a loan may be 0.01 overall, but if a particular applicant has low income, low assets, no job and a history of defaulting, this probability may be 0.3 for this particular instance. (And 0.001 for someone with better characteristics.) ... — Stephan Kolassa, Jan 26 '21 at 07:04
... Note that even the "high risk" may have a risk lower than 0.5! Next, we need to make decisions based on these probabilities. We may offer the good risk better conditions, or not offer a loan to the bad risk at all. This decision should depend on the predicted probabilities of defaulting, and also the costs involved. (More precisely, we should model the probability of defaulting after a certain time, when some of the loan has already been repaid, so we have already had some of the principal and interest repaid.) — Stephan Kolassa, Jan 26 '21 at 07:07
I struggle to see how it is not an issue at some points - take e.g a Random Forest with Gini as splitting criteria. Imbalanced data here really can mess up the splits, due to the definition. I read this post/answer as "imbalanced data isn't a problem when handled correctly" which of course is a trivial statement, which can be applied to everything. The issue is often that the fitting of most classifiers are written for balanced data - and I wonder how that it is not an issue (if not handled)? — CutePoison, Jul 10 '21 at 17:31
@CutePoison: you are completely right that it's trivial that "unbalanced" data are not a problem when treated correctly - and also that many classifiers work only for balanced data. (Most of these IMO written by people with little statistical understanding.) Which causes no end of issues, mainly if classifiers are evaluated using accuracy and similar - witness almost daily questions here on CV. And yes, the answer is simple: don't use inappropriate models, or KPIs. This is not rocket science, but it's still apparently less well known than it should be. — Stephan Kolassa, Jul 12 '21 at 08:27
I was wondering the same thing. I had great accuracy since my stroke target was imbalanced 95/5 with the [0, no stroke] outcome as the majority. My baseline as instructed by my university prof was 95% and I cringed at the thought of trying to build a model to surpass 95%. Originally I downsampled the majority but still couldn't get better than baseline so I switched to upsampling the minority. After that I was able to get 98% using CART, KKN. But the whole time I felt like a fraud because of balancing and aiming for near 100% which is obviously insane. — Edison, Aug 29 '21 at 05:34
@Edison: per this thread, don't use accuracy. Also: Is accuracy an improper scoring rule in a binary classification setting? and Classification probability threshold. Instead, use probabilistic classifications, and evaluate these using proper scoring rules. — Stephan Kolassa, Aug 29 '21 at 08:46
@StephanKolassa Thanks for that. Btw, for binary classification, can we use recall and precision as mentioned in other threads if we don't want to use Scoring Rules? And what if I have already upsampled my minority outcome? Is it then ok to use accuracy or recall or precision? — Edison, Aug 29 '21 at 12:41
@Edison: you can use them, as in "you can choose to shoot yourself in the foot". It's still not a good idea, because optimizing them will give you biased estimates and predictions, completely analogously to how optimizing accuracy will (as in: if you have no useful predictors, then optimizing accuracy will automatically lead you to always predicting the majority class). — Stephan Kolassa, Aug 29 '21 at 12:52
Just to be clear because I'm still a student, you are saying I should not use (accuracy, recall, precision) for binary classification in any circumstance even if I have a balanced target or have balanced the target myself? In binary classification only use Scoring Rules? Would using recall or precision be better than using accuracy or would it be just as detrimental? — Edison, Aug 29 '21 at 13:07
@Edison: yes, that is exactly what I am recommending. This thread may be helpful. I also do not see why recall or precision should be preferable to accuracy. (My nagging suspicion is that someone noticed problems with accuracy and looked for some other KPI that looked like adding it to the mix would improve matters - without digging deeply enough and noticing that the problem lies with hard 0-1 classifications in the first place.) — Stephan Kolassa, Aug 29 '21 at 13:16
Thanks. I'm so glad we are being taught to use accuracy for binary classification in my graduate program ;) — Edison, Aug 29 '21 at 13:50
@Sycorax: I think you are right. It might have been started with the ML community. However to me it's not about accuracy but convergence. Despite having this conversation (and other great ones about proper metrics and synthetic data) in mind I tried some class weights in my latest "Deep NN" experiment. I haven't completely sorted it out, but class weights seems to have an impact on convergence. Sometimes it leads to faster convergence, sometimes it helps avoid local minima and get better performance. It also seems to allievate 'seed-dependance' phenomenon. — Lucas Morin, Sep 08 '21 at 10:16
All this to say that there might be some empirical evidence from the ML community in favor of rebalancing classes. And the point where it would be usefull (reducing convergence time, reducing dependance to initialisation, avoiding local minima traps) are not covered with the logistic example above. — Lucas Morin, Sep 08 '21 at 10:21
@lcrmorin This paper may be of interest What is the Effect of Importance Weighting in Deep Learning? by Jonathon Byrd, Zachary Lipton. — Sycorax, Sep 08 '21 at 13:31
Thank you for sharing. Somehow I wasn't able to find it when I was looking for litterature. It does seem to confirm there are positive impacts of weighting for DL. In short: weigthing implies earlier convergence, interacts with L2 regularisation, batch norm but not drop out. — Lucas Morin, Sep 08 '21 at 14:23
This should probably be reformatted so the bulk of the question becomes an answer. Right now it can't be used as a target when flagging duplicates — Hong Ooi, Sep 16 '21 at 13:23
@Sycorax another element of "how did we get here" might be: there's tons of recent academic papers about SMOTE and its close relatives, but I cannot find a single good reference paper that I can point people to and that explains the issue and the "correct" way of solving these problems clearly. — Eike P., Nov 11 '21 at 19:53
General comment about statistical analysis: any method that disrespects the original sample size and how the sample came about is bogus. — Frank Harrell, Jan 05 '22 at 13:42
I only just noticed this, but could you please remove the rm(list=ls()) line from your code? — Dave, Jan 05 '22 at 16:31
@Dave: to be honest, I would rather keep it in. I have too often seen "reproducible" code that only ran because the R workspace contained something the poster forgot to define in their code. I'll take it out because you asked so nicely... — Stephan Kolassa, Jan 05 '22 at 16:59
The Brier score can be partially remedied by adding the bias term induced by the resampling (provided in the linked answer); maybe add the violin plot with that? — Ben Reiniger, Oct 01 '22 at 16:06
For some evidence where it can improve (AU)ROC, see https://stats.stackexchange.com/q/128777/232706 — Ben Reiniger, Oct 01 '22 at 16:06
@StephanKolassa Would you be open to rewording your "Caveat" in the question, to make a very clear distinction between (1) oversampling as part of model fitting after you've already collected the full dataset (which I agree is a bad idea), vs (2) oversampling rare cases during data collection itself (which can be a good idea)? To me, (2) is not a caveat to (1) but rather an entirely different situation. — civilstat, May 03 '23 at 14:37
Possibly interesting: van den Goorbergh, Ruben, et al. "The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression." Journal of the American Medical Informatics Association 29.9 (2022): 1525-1534. — Dave, Nov 08 '23 at 00:37
@civilstat: it took me half a year, but yes, that is a good point, and I edited the post to reflect that. — Stephan Kolassa, Nov 14 '23 at 15:07
Just a question. Could this make a larger difference when we train models via stochastic gradient descent with complicated loss landscapes? E.g. if we have a low density of positive examples (and do not use class weights), the model will learn only very late in the training from them. (sorry if this point was already addressed, this thread is huge..) — Thomas, Feb 05 '24 at 10:25

Dikran Marsupial · Accepted Answer · 2022-01-08T09:30:22.427

I'd like to start by seconding a statement in the question:

... my point is that the questions on unbalanced datasets at CV do not mention such a tradeoff, but treat unbalanced classes as a self-evident evil, completely apart from any costs of sample collection.

I also have the same concern, my questions here and here are intended to invite counter-evidence that it is a "self-evident evil" the lack of answers (even with a bounty) suggests it isn't. A lot of blog posts and academic papers don't make this clear either. Classifiers can have a problem with imbalanced datasets, but only where the dataset is very small, so my answer is concerned with exceptional cases, and does not justify resampling the dataset in general.

There is a class imbalance problem, but it is not caused by the imbalance per se, but because there are too few examples of the minority class to adequately describe it's statistical distribution. As mentioned in the question, this means that the parameter estimates can have high variance, which is true, but that can give rise to a bias in favour of the majority class (rather than affecting both classes equally). In the case of logistic regression, this is discussed by King and Zeng,

3 Gary King and Langche Zeng. 2001. “Logistic Regression in Rare Events Data.” Political Analysis, 9, Pp. 137–163. https://j.mp/2oSEnmf

[In my experiments I have found that sometimes there can be a bias in favour of the minority class, but that is caused by wild over-fitting where the class-overlap dissapears due to random sampling, so that doesn't really count and (Bayesian) regularisation ought to fix that]

The good thing is that MLE is asymptotically unbiased, so we can expect this bias against the minority class to go away as the overall size of the dataset increases, regardless of the imbalance.

As this is an estimation problem, anything that makes estimation more difficult (e.g. high dimensionality) seems likely to make the class imbalance problem worse.

Note that probabilistic classifiers (such as logistic regression) and proper scoring rules will not solve this problem as "popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events" 3. This means that your probability estimates will not be well calibrated, so you will have to do things like adjust the threshold (which is equivalent to re-sampling or re-weighting the data).

So if we look at a logistic regression model with 10,000 samples, we should not expect to see an imbalance problem as adding more data tends to fix most estimation problems.

So an imbalance might be problematic, if you have an extreme imbalance and the dataset is small (and/or high dimensional etc.), but in that case it may be difficult to do much about it (as you don't have enough data to estimate how big a correction to the sampling is needed to correct the bias). If you have lots of data, the only reason to resample is because operational class frequencies are different to those in the training set or different misclassification costs etc. (if either are unknown or variable, your really ought to use a probabilistic classifier).

This is mostly a stub, I hope to be able to add more to it later.

Thank you, I am looking forward to your expanding this. If I understand you correctly, the class imbalance problem you see is high variance of parameter estimates, right? It seems to me that oversampling etc. would not address this, correct? — Stephan Kolassa, Jan 05 '22 at 14:00
@StephanKolassa in principle it can, but you need to know the right amount of oversampling to apply (or equivalently reweighting or threshold adjustment), which is going to be difficult if you don't have enough data to estimate the model in the first place. King takes the threshold adjustment approach, as it can be analytically approximated for logistic regression, but it is not clear it has great practical utility. — Dikran Marsupial, Jan 05 '22 at 14:05
@StephanKolassa I think I didn't give a very direct answer to your question. I think what is happening is that the variance in the parameter estimates causes the undue bias against the minority class because of the structure of the problem. Reducing the variance ought to reduce the bias, but resampling (or better re-weighting the data in the loss) can address the bias directly. However, it will be difficult to estimate how much regularisation or re-sampling/re-weighting is required in practice. — Dikran Marsupial, Jan 08 '22 at 09:26
I just finished reading the King & Zeng paper, thank you for drawing my attention to it! It was most illuminating, and I have learned something today. (To add to your comments, they add one additional reason in favor of nonrandom sampling: the costs of collecting data.) — Stephan Kolassa, Jan 09 '22 at 13:09
My intuition is that logistic regression is likely to be more robust to this sort of bias than most, and that some classifiers may be more susceptible to problems in practical applications (but still with small datasets) and that is perhaps why there is some perception that class imbalance is a problem. — Dikran Marsupial, Jan 09 '22 at 13:15
@DikranMarsupial Have you ever seen Wallace & Dahabreh (IEEE 2012) Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them)? It's behind a paywall, but their equation #6 seems to align with your idea that the bias decreases as the sample size gets large. — Dave, Nov 27 '23 at 21:28
@Dave, yes, I think it is along similar lines to King and Zheng. However when I tried this, I had a lot of problems with the univariate logistic regression being perfectly separable, which tended to make a bias in the other direction. There is another paper (but I can't remember the details) where a similar bias correction was applied to kernel logistic regression. — Dikran Marsupial, Nov 28 '23 at 10:48

score 10 · Answer 2 · edited Apr 11 '23 at 05:19

I generally agree with your premise that there is an over-fixation on balancing classes, and that it is usually not necessary to do so. Your examples of when it is appropriate to do so are goods ones.

However, I disagree with your statement:

I conclude that unbalanced classes are not a problem, and that oversampling does not alleviate this non-problem, but gratuitously introduces bias and worse predictions.

The problem in your predictions is not the oversampling procedure, it is the failure to correct for the fact that the base-rate for positives in the "over-sampled" (50/50) regression is 50%, while in the data it is closer to 2%.

Following King and Zeng ("Logistic Regression in Rare Events Data", 2001, Political Analysis, PDF here), let the population base rate be given by $\tau$. We estimate $\tau$ as the proportion of positives in the training sample: $$ \tau = \frac{1}{N}\sum_{i=1}^N y_i $$ And let $\bar{y}$ be the proportion of positives in the over-sampled set, $\bar{y}=0.5$. This is by construction since you use a balanced 50/50 sample in the over-sampled regression.

Then, after using the predict command to generate predicted probabilities $P(y|x,d)$ we adjust these probabilities using the formula in King and Zeng, appendix B.2 to find the probability under the population base rate. This probability is given by $P(y=1|x,d)A_1B$. In the case of two classes: $$ P(y=1|x,d)A_1B = \frac{P(y=1|x,d) \frac{\tau}{\bar{y}}}{P(y=1|x,d) \frac{\tau}{\bar{y}} + P(y=0|x,d) \frac{1-\tau}{1-\bar{y}}} $$ Since $\bar{y}=0.5$ this simplifies to: $$ P(y=1|x,d)A_1B = \frac{P(y=1|x,d) \tau}{P(y=1|x,d) \tau + P(y=0|x,d) (1-\tau)} $$

Modifying your code in the relevant places, we now have very similar Brier scores between the two approaches, despite the fact that the over-sampled training sample uses an order of magnitude less data than the raw training sample (in most cases, roughly 450 data points vs. 10,000).

So, in this Monte Carlo study, we see that balancing the training sample does not harm predictive accuracy (as judged by Brier score), but it also does not provide any meaningful increase in accuracy. The only benefit of balancing the training sample in this particular application is to reduce the computational burden of estimating the binary predictor. In the present case, we only need ~450 data points instead of 10,000. The reduction in computational burden would be much more substantial if we were dealing with millions of observations in the raw data.

The modified code is given below:

library(randomForest)
library(beanplot)
nn_train <- nn_test <- 1e4
n_sims <- 1e2
true_coefficients <- c(-7, 5, rep(0, 9))
incidence_train <- rep(NA, n_sims)
model_logistic_coefficients <- 
  model_logistic_oversampled_coefficients <- 
  matrix(NA, nrow=n_sims, ncol=length(true_coefficients))
brier_score_logistic <- brier_score_logistic_oversampled <- 
  brier_score_randomForest <- 
  brier_score_randomForest_oversampled <- 
  rep(NA, n_sims)
pb <- txtProgressBar(max=n_sims)
for ( ii in 1:n_sims ) {
  setTxtProgressBar(pb,ii,paste(ii,"of",n_sims))
  set.seed(ii)
  while ( TRUE ) {    # make sure we even have the minority 
    # class
    predictors_train <- matrix(
      runif(nn_train(length(true_coefficients) - 1)), 
      nrow=nn_train)
    logit_train <- 
      cbind(1, predictors_train)%%true_coefficients
    probability_train <- 1/(1+exp(-logit_train))
    outcome_train <- factor(runif(nn_train) <= 
                              probability_train)
    if ( sum(incidence_train[ii] <- 
             sum(outcome_train==TRUE))>0 ) break
  }
  dataset_train <- data.frame(outcome=outcome_train, 
                              predictors_train)
index <- c(which(outcome_train==TRUE),

             sample(which(outcome_train==FALSE),

                    sum(outcome_train==TRUE)))
model_logistic <- glm(outcome~., dataset_train, 
                        family="binomial")
  model_logistic_oversampled <- glm(outcome~., 
                                    dataset_train[index, ], family="binomial")
model_logistic_coefficients[ii, ] <- 
    coefficients(model_logistic)
  model_logistic_oversampled_coefficients[ii, ] <- 
    coefficients(model_logistic_oversampled)
model_randomForest <- randomForest(outcome~., dataset_train)
  model_randomForest_oversampled <- 
    randomForest(outcome~., dataset_train, subset=index)
predictors_test <- matrix(runif(nn_test * 
                                    (length(true_coefficients) - 1)), nrow=nn_test)
  logit_test <- cbind(1, predictors_test)%*%true_coefficients
  probability_test <- 1/(1+exp(-logit_test))
  outcome_test <- factor(runif(nn_test)<=probability_test)
  dataset_test <- data.frame(outcome=outcome_test, 
                             predictors_test)
prediction_logistic <- predict(model_logistic, dataset_test, 
                                 type="response")
  brier_score_logistic[ii] <- mean((prediction_logistic - 
                                      (outcome_test==TRUE))^2)
prediction_logistic_oversampled <-

    predict(model_logistic_oversampled, dataset_test, 
            type="response")
Adjust probabilities based on appendix B.2 in King and Zeng (2001)
p1_tau1 = prediction_logistic_oversampled(incidence_train[ii]/nn_train)
  p0_tau0 = (1-prediction_logistic_oversampled)(1-incidence_train[ii]/nn_train)
  prediction_logistic_oversampled_adj <- p1_tau1/(p1_tau1+p0_tau0)
brier_score_logistic_oversampled[ii] <- 
    mean((prediction_logistic_oversampled_adj - 
            (outcome_test==TRUE))^2)
prediction_randomForest <- predict(model_randomForest, 
                                     dataset_test, type="prob")
  brier_score_randomForest[ii] <-
    mean((prediction_randomForest[,2]-(outcome_test==TRUE))^2)
prediction_randomForest_oversampled <-

    predict(model_randomForest_oversampled, 
            dataset_test, type="prob")
Adjust probabilities based on appendix B.2 in King and Zeng (2001)
p1_tau1 = prediction_randomForest_oversampled(incidence_train[ii]/nn_train)
  p0_tau0 = (1-prediction_randomForest_oversampled)(1-incidence_train[ii]/nn_train)
  prediction_randomForest_oversampled_adj <- p1_tau1/(p1_tau1+p0_tau0)
brier_score_randomForest_oversampled[ii] <- 
    mean((prediction_randomForest_oversampled_adj[, 2] - 
            (outcome_test==TRUE))^2)
}
close(pb)
hist(incidence_train, breaks=seq(min(incidence_train)-.5, 
                                 max(incidence_train) + .5),
     col="lightgray",
     main=paste("Minority class incidence out of", 
                nn_train,"training samples"), xlab="")
ylim <- range(c(model_logistic_coefficients, 
                model_logistic_oversampled_coefficients))
beanplot(data.frame(model_logistic_coefficients),
         what=c(0,1,0,0), col="lightgray", xaxt="n", ylim=ylim,
         main="Logistic regression: estimated coefficients")
axis(1, at=seq_along(true_coefficients),
     c("Intercept", paste("Predictor", 1:(length(true_coefficients) 
                                          - 1))), las=3)
points(true_coefficients, pch=23, bg="red")
beanplot(data.frame(model_logistic_oversampled_coefficients),
         what=c(0, 1, 0, 0), col="lightgray", xaxt="n", ylim=ylim,
         main="Logistic regression (oversampled): estimated 
              coefficients")
axis(1, at=seq_along(true_coefficients),
     c("Intercept", paste("Predictor", 1:(length(true_coefficients) 
                                          - 1))), las=3)
points(true_coefficients, pch=23, bg="red")
beanplot(data.frame(Raw=brier_score_logistic, 
                    Oversampled=brier_score_logistic_oversampled),
         what=c(0,1,0,0), col="lightgray", main="Logistic regression: 
             Brier scores")
beanplot(data.frame(Raw=brier_score_randomForest, 
                    Oversampled=brier_score_randomForest_oversampled),
         what=c(0,1,0,0), col="lightgray", 
         main="Random Forest: Brier scores")

Thank you. Just a note for later readers: Dikran Marsupial also discusses the King & Zeng paper in their answer. — Stephan Kolassa, Apr 11 '23 at 05:19

score 3 · Answer 3 · answered Sep 02 '23 at 03:47

3

One common real world reason you want to up-sample rare outcomes: limited data capacity.

Suppose you have an outcome that only happen with $p = 10^{-3}$. If you have 1000 columns of features available, note that if you don't upweight your positive results, then you need to sample $10^6$ rows of data (which is $10^9$ values, likely to strain most desktops) in order to get 1 positive outcome for every column. With this amount of data, you are very likely to be excessively overfit.

If instead you were to do 50% sample of positives and 50% sample of negatives, you are likely to do much better in finding the features that better differentiate positives and negatives with the same sample size. Of course, you would need to include sampling weights in your likelihood function to remove the bias from your stratified sampling.

answered Sep 02 '23 at 03:47

Cliff AB

20,980

This sounds like the exact argument given in the King and Zeng paper from Marsupial’s answer. Do you see it differently? – Dave Sep 02 '23 at 04:03
@Dave yeah it's very similar to the arguments in King and Zeng. But to clarify, there's also some specification about dimensions. In many industry ML problems, the available dataset is much much more than can be fit into memory especially with a very large number of features. By upsampling rare events, we can be much more efficient with the amount of data we can actually work with at one time. – Cliff AB Sep 02 '23 at 09:24
It sounds to me like the most important part of your answer is at the very end: the need to include weights in the estimation. This is the one aspect that seems to be missing from most questions on up/downsampling we get here, unfortunately. – Stephan Kolassa Sep 06 '23 at 14:18
@StephanKolassa Would that be a type of calibration? – Dave Sep 06 '23 at 14:19
@Dave: hm. The term "calibration" to me suggests that "predictive densities do what they are supposed to do", per Gneiting's use of the term. But I have also seen this term used to refer to a post-processing of predictive densities, e.g., by pushing predicted probabilities from a model through a logistic regression (possibly with a spline transform). What sense are you using the term in here? In any case, these meanings are not really the same as using weights in our estimation, which Cliff seems to advocate here. – Stephan Kolassa Sep 06 '23 at 14:26
I would see "the predictive densities do what they are supposed to do" as the adjective calibrated, while calibation would be the act of adjusting predicted values to better reflect the reality of how frequently events occur so the two-stage model would be calibrated. I think I see weighting after oversampling as a form of calibration. When we oversample, we give too high of a prior probability of the true minority category, so the posterior/predicted probabilities are too high. However, we can do some kind of weighting to calibrate the predictions to account for the oversampling. – Dave Sep 06 '23 at 14:35
@CliffAB Why is it desirable to have one positive outcome for each column? – VKV Jan 17 '24 at 19:28
@VKV It isn't. That means there is one minority-class observation per feature (column), putting the model at a severe risk of overfitting. – Dave Jan 17 '24 at 19:34

JeeyCi · Answer 4 · 2022-07-30T11:52:18.610

I fully agree with a marked answer "No problem of imbalance per se" - the problem is Lack of data - when minority class do not form Gaussian distribution & thus minority class just remains to be the outlier in the overall distribution == this is problem for Unsupervised methods.

But if are making Supervised learning, knowing target classes & their characteristics (features) in advance -- imbalanced ds could not be a problem IF deal it correctly. -- at least each Cross-Validated subsample (e.g. in ensemble bagging & boosting) should include representatives of both (source) classes for comprehensive approximation of decision boundary

Main problem of minority class: "the model cannot model the boundary of these low-density regions well during the learning, resulting in ambiguity and poor generalization". Solution was found in (1) semi-supervised learning method, which generates pseudo-labels on unlabeled data and then train together, OR (2) (when extreme imbalance, as medical) self-supervised pre-training with further main training

So, decrease of IMBALANCE Problem can be achieved by means of programming logics or (better) by increasing qty of samples == even without oversampling (artificial injection of fake minority class samples)

P.S. some hints:

first - "Exactly like we should do feature selection inside the cross validation loop, we should also oversample inside the loop." (source)

second "Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%."

third "when comparing two binary classifiers, the AUC is one of the criteria that should not fooled by the imbalancedness of the data."

forth as alternative - Weighted Cost/Loss Function - "Thanks to the Sklearn, there is a built-in parameter called class_weight in most of the ML algorithms which helps you to balance the contribution of each class." - e.g weighted Sigmoid Cross-Entropy loss for binary CLF

fifth see 8th at link - "be creative"

sixth Threshold moving & Searching optimal value from a grid

I'm afraid I do not understand this answer at all. The unusual punctuation doesn't help, but I suspect I wouldn't follow it even otherwise. Could you edit it to make your point more clearly? — mkt, Jul 30 '22 at 05:54
The point about only resampling the training data is a good one. I have often wondered whether papers promoting SMOTE give a false impression of how well it works because of information leaking from test set to training set via the synthetic examples. We want to know how well the model will work in operational conditions, so the test set should be representative of test conditions (where you won't be using SMOTE) — Dikran Marsupial, Jul 30 '22 at 07:05
I am confused by the comment about the minority class not being Gaussian. — Dave, Jul 30 '22 at 11:07
when exploring rare events & having lack of data - of course you cannot see on KDE-charts normal distribution for both classes, only for majority class. — JeeyCi, Jul 30 '22 at 12:10
But if you cannot even draw your samples && see the decision boundary as is on figure - there is no sence for further automatization of its probabilistic estimation with mathematical tools (such as Bayesian logics). If there is no statistically meaningfull data for your research - Bayes will not give you trustfull probability of boundary between true & false cases. I prefer to see apriory: If not seen in KDE-charts - there is nothing for binary division mathematically further... (suppose rare medical disease) — JeeyCi, Jul 30 '22 at 12:12

StanW · Answer 5 · 2023-04-01T11:36:39.710

Edit to summarize the following arguments and simulations:

I propose that balancing by either over-/undersampling or class weights is an advantage during training of gradient descent models that use sampling procedures during training (i.e. subsampling, bootstraping, minibatches etc., as used in e.g. neural networks and gradient boosting). I propose that this is due to an improved signal to noise ratio of the gradient of the loss function which is explained by:

Improved Signal (larger gradient of the loss function, as suggested by the first simulation)
Reduced noise of the gradient due to sampling in a balanced setting vs. strongly unbalanced (as supported by the second simulation).

Original answer: To make my point I have modified your code to include a "0" (or baseline) model for each run, where the first predictor column is removed, thus retaining only the remaining 9 predictors which have no relationship to the outcome (full code below). In the end I calculate the Brier scores for logistic and randomForest models and compare the differences with the full model. The full code is below. When I now compare the change in Brier score from the "0" models to the full original models (which include predictor 1) I observe:

>     round( quantile( (brier_score_logistic - brier_score_logistic_0)/brier_score_logistic_0), 3)
0%    25%    50%    75%   100% 
-0.048 -0.038 -0.035 -0.032 -0.020 
>     round( quantile( (brier_score_logistic_oversampled - brier_score_logistic_oversampled_0)/brier_score_logistic_oversampled_0),3)
0%    25%    50%    75%   100% 
-0.323 -0.258 -0.241 -0.216 -0.130 
>     round( quantile( (brier_score_randomForest - brier_score_randomForest_0)/brier_score_randomForest_0), 3)
0%    25%    50%    75%   100% 
-0.050 -0.037 -0.032 -0.026 -0.009 
>     round( quantile( (brier_score_randomForest_oversampled - brier_score_randomForest_oversampled_0)/brier_score_randomForest_oversampled_0), 3)
0%    25%    50%    75%   100% 
-0.306 -0.272 -0.255 -0.233 -0.152

What seems clear is that for the same predictor the relative change in the Brier score jumps from a median of around 0.035 in an imbalanced setting to a 0.241 in a balanced setting giving a roughly 7x higher gradient for a predictive model vs. a baseline. Additionally when you look at the absolute Brier scores, the baseline model in an unbalanced setting performs much better than the full model in the balanced setting:

>     round( quantile(brier_score_logistic_0), 5)
0%     25%     50%     75%    100% 
0.02050 0.02363 0.02450 0.02545 0.02753 
>     round( quantile(brier_score_logistic_oversampled), 5)
0%     25%     50%     75%    100% 
0.17576 0.18842 0.19294 0.19916 0.23089

Thus concluding that a smaller Brier is better per se will lead to wrong conclusions if say you are comparing datasets with different predictor or outcome prevalences.

Overall to me there seem to be two advanteges/problems:

Balancing the datasets seems to get you a higher gradient, which should be beneficial for training of gradient descent algorithms (xgboost, neural networks). In my experience without balancing the neural network might just learn to guess the class with the higher probability without learning any data features if the dataset is too unbalanced.
Comparability between different studies/patient populations/biomarkers may benefit from measures which are less sensitive to changes in prevalence such as AUC or C-index or maybe a stratified Brier. As the example shows that a strong imbalance diminishes the difference between a baseline model and a predictive model. This works goes to a similar direction: ieeexplore.ieee.org/document/6413859

Edit: To follow up on the discussion in the comments, which partially concerns the error due to sampling for a model trained on an imbalanced vs. a balanced dataset I used a second small modification to the script (full version 2 of the new script below). In this modification the datasets for the testing of the original predictive models is performed on one test set, while the "0" models are tested on a separate "test_set_new", which is generated using the same code. This represents either a new sample from the same population or a new "batch" or "minibatch" or subset of the data as used for training models with gradient descent. Now the "gradient" of the Brier from a non-predictive to a predictive model seems quite revealing:

>     round( quantile( (brier_score_logistic - brier_score_logistic_0)/brier_score_logistic_0), 3)
0%    25%    50%    75%   100% 
-0.221 -0.100 -0.052  0.019  0.131 
>     round( quantile( (brier_score_logistic_oversampled - brier_score_logistic_oversampled_0)/brier_score_logistic_oversampled_0),3)
0%    25%    50%    75%   100% 
-0.318 -0.258 -0.242 -0.215 -0.135 
>     
  >     round( quantile( (brier_score_randomForest - brier_score_randomForest_0)/brier_score_randomForest_0), 3)
0%    25%    50%    75%   100% 
-0.213 -0.092 -0.046  0.020  0.127 
>     round( quantile( (brier_score_randomForest_oversampled - brier_score_randomForest_oversampled_0)/brier_score_randomForest_oversampled_0), 3)
0%    25%    50%    75%   100% 
-0.304 -0.273 -0.255 -0.232 -0.155 
>     round( mean(brier_score_logistic>brier_score_logistic_0), 3)
[1] 0.31
>     round( mean(brier_score_randomForest>brier_score_randomForest_0), 3)
[1] 0.33

So now in 31-33% of simulations for imbalanced models the Brier score of "0" model is "better" (smaller) than the score of the predictive model, despite a sample size of 10,000! While for models trained on balanced data the gradient of the Brier is consistently in the right direction (predictive models lower than "0" models). This seems to me to be quite clearly due to the sampling variability in the imbalanced setting, where even small variations (individual observation) result in a much stronger variability in performance (as observed above the overall Brier is more strongly affected by prevalence than by actual predictors when trained on an imbalanced dataset). As discussed below I expect that this may strongly affect any sampling approaches during gradient descent training (minibatch, subsampling, etc.), while when using the exactly same dataset during each epoch the effect may be less prominent.

The modified version of OP's code:

library(randomForest)
library(beanplot)
nn_train <- nn_test <- 1e4
n_sims <- 1e2
true_coefficients <- c(-7, 5, rep(0, 9))
incidence_train <- rep(NA, n_sims)
model_logistic_coefficients <- 
  model_logistic_oversampled_coefficients <- 
  matrix(NA, nrow=n_sims, ncol=length(true_coefficients))
brier_score_logistic <- brier_score_logistic_oversampled <- 
  brier_score_logistic_0 <-
  brier_score_logistic_oversampled_0 <- 
  brier_score_randomForest <- 
  brier_score_randomForest_oversampled <- 
  brier_score_randomForest_0 <- 
  brier_score_randomForest_oversampled_0 <- 
  rep(NA, n_sims)
#pb <- winProgressBar(max=n_sims)
for ( ii in 1:n_sims ) {
  print(ii)#setWinProgressBar(pb,ii,paste(ii,"of",n_sims))
  set.seed(ii)
  while ( TRUE ) {    # make sure we even have the minority 
    # class
    predictors_train <- matrix(
      runif(nn_train(length(true_coefficients) - 1)), 
      nrow=nn_train)
    logit_train <- 
      cbind(1, predictors_train)%%true_coefficients
    probability_train <- 1/(1+exp(-logit_train))
    outcome_train <- factor(runif(nn_train) <= 
                              probability_train)
    if ( sum(incidence_train[ii] <- 
             sum(outcome_train==TRUE))>0 ) break
  }
  dataset_train <- data.frame(outcome=outcome_train, 
                              predictors_train)
index <- c(which(outcome_train==TRUE),

             sample(which(outcome_train==FALSE),

                    sum(outcome_train==TRUE)))
model_logistic <- glm(outcome~., dataset_train, 
                        family="binomial")
  model_logistic_0 <- glm(outcome~., dataset_train[,-2], 
                        family="binomial")
  model_logistic_oversampled <- glm(outcome~., 
                                    dataset_train[index, ], family="binomial")
  model_logistic_oversampled_0 <- glm(outcome~., 
                                    dataset_train[index, -2], family="binomial")
model_logistic_coefficients[ii, ] <- 
    coefficients(model_logistic)
  model_logistic_oversampled_coefficients[ii, ] <- 
    coefficients(model_logistic_oversampled)
model_randomForest <- randomForest(outcome~., dataset_train)
  model_randomForest_0 <- randomForest(outcome~., dataset_train[,-2])
model_randomForest_oversampled <- 
    randomForest(outcome~., dataset_train, subset=index)
  model_randomForest_oversampled_0 <- 
    randomForest(outcome~., dataset_train[,-2], subset=index)
predictors_test <- matrix(runif(nn_test * 
                                    (length(true_coefficients) - 1)), nrow=nn_test)
  logit_test <- cbind(1, predictors_test)%*%true_coefficients
  probability_test <- 1/(1+exp(-logit_test))
  outcome_test <- factor(runif(nn_test)<=probability_test)
  dataset_test <- data.frame(outcome=outcome_test, 
                             predictors_test)
prediction_logistic <- predict(model_logistic, dataset_test, 
                                 type="response")
  brier_score_logistic[ii] <- mean((prediction_logistic - 
                                      (outcome_test==TRUE))^2)
  prediction_logistic_0 <- predict(model_logistic_0, dataset_test[,-2], 
                                 type="response")
  brier_score_logistic_0[ii] <- mean((prediction_logistic_0 - 
                                      (outcome_test==TRUE))^2)
prediction_logistic_oversampled <-

    predict(model_logistic_oversampled, dataset_test, 
            type="response")
  brier_score_logistic_oversampled[ii] <- 
    mean((prediction_logistic_oversampled - 
            (outcome_test==TRUE))^2)
  prediction_logistic_oversampled_0 <-

    predict(model_logistic_oversampled_0, dataset_test[,-2], 
            type="response")
  brier_score_logistic_oversampled_0[ii] <- 
    mean((prediction_logistic_oversampled_0 - 
            (outcome_test==TRUE))^2)
prediction_randomForest <- predict(model_randomForest, 
                                     dataset_test, type="prob")
  brier_score_randomForest[ii] <-
    mean((prediction_randomForest[,2]-(outcome_test==TRUE))^2)
prediction_randomForest_0 <- predict(model_randomForest_0, 
                                     dataset_test[,-2], type="prob")
  brier_score_randomForest_0[ii] <-
    mean((prediction_randomForest_0[,2]-(outcome_test==TRUE))^2)
prediction_randomForest_oversampled <-

    predict(model_randomForest_oversampled, 
            dataset_test, type="prob")
  brier_score_randomForest_oversampled[ii] <- 
    mean((prediction_randomForest_oversampled[, 2] - 
            (outcome_test==TRUE))^2)
prediction_randomForest_oversampled_0 <-

    predict(model_randomForest_oversampled_0, 
            dataset_test, type="prob")
  brier_score_randomForest_oversampled_0[ii] <- 
    mean((prediction_randomForest_oversampled_0[, 2] - 
            (outcome_test==TRUE))^2)
}
#close(pb)
quantile( (brier_score_logistic - brier_score_logistic_0)/brier_score_logistic_0)
quantile( (brier_score_logistic_oversampled - brier_score_logistic_oversampled_0)/brier_score_logistic_oversampled_0)
quantile( (brier_score_randomForest - brier_score_randomForest_0)/brier_score_randomForest_0)
quantile( (brier_score_randomForest_oversampled - brier_score_randomForest_oversampled_0)/brier_score_randomForest_oversampled_0)

Version 2:

library(randomForest)
library(beanplot)
nn_train <- nn_test <- 1e4
n_sims <- 1e2
true_coefficients <- c(-7, 5, rep(0, 9))
incidence_train <- rep(NA, n_sims)
model_logistic_coefficients <- 
  model_logistic_oversampled_coefficients <- 
  matrix(NA, nrow=n_sims, ncol=length(true_coefficients))
brier_score_logistic <- brier_score_logistic_oversampled <- 
  brier_score_logistic_0 <-
  brier_score_logistic_oversampled_0 <- 
  brier_score_randomForest <- 
  brier_score_randomForest_oversampled <- 
  brier_score_randomForest_0 <- 
  brier_score_randomForest_oversampled_0 <- 
  rep(NA, n_sims)
#pb <- winProgressBar(max=n_sims)
for ( ii in 1:n_sims ) {
  print(ii)#setWinProgressBar(pb,ii,paste(ii,"of",n_sims))
  set.seed(ii)
  while ( TRUE ) {    # make sure we even have the minority 
    # class
    predictors_train <- matrix(
      runif(nn_train(length(true_coefficients) - 1)), 
      nrow=nn_train)
    logit_train <- 
      cbind(1, predictors_train)%%true_coefficients
    probability_train <- 1/(1+exp(-logit_train))
    outcome_train <- factor(runif(nn_train) <= 
                              probability_train)
    if ( sum(incidence_train[ii] <- 
             sum(outcome_train==TRUE))>0 ) break
  }
  dataset_train <- data.frame(outcome=outcome_train, 
                              predictors_train)
index <- c(which(outcome_train==TRUE),

             sample(which(outcome_train==FALSE),

                    sum(outcome_train==TRUE)))
model_logistic <- glm(outcome~., dataset_train, 
                        family="binomial")
  model_logistic_0 <- glm(outcome~., dataset_train[,-2], 
                        family="binomial")
  model_logistic_oversampled <- glm(outcome~., 
                                    dataset_train[index, ], family="binomial")
  model_logistic_oversampled_0 <- glm(outcome~., 
                                    dataset_train[index, -2], family="binomial")
model_logistic_coefficients[ii, ] <- 
    coefficients(model_logistic)
  model_logistic_oversampled_coefficients[ii, ] <- 
    coefficients(model_logistic_oversampled)
model_randomForest <- randomForest(outcome~., dataset_train)
  model_randomForest_0 <- randomForest(outcome~., dataset_train[,-2])
model_randomForest_oversampled <- 
    randomForest(outcome~., dataset_train, subset=index)
  model_randomForest_oversampled_0 <- 
    randomForest(outcome~., dataset_train[,-2], subset=index)
predictors_test <- matrix(runif(nn_test * 
                                    (length(true_coefficients) - 1)), nrow=nn_test)
  logit_test <- cbind(1, predictors_test)%*%true_coefficients
  probability_test <- 1/(1+exp(-logit_test))
  outcome_test <- factor(runif(nn_test)<=probability_test)
  dataset_test <- data.frame(outcome=outcome_test, 
                             predictors_test)
prediction_logistic <- predict(model_logistic, dataset_test, 
                                 type="response")
  brier_score_logistic[ii] <- mean((prediction_logistic - 
                                      (outcome_test==TRUE))^2)
prediction_logistic_oversampled <-

    predict(model_logistic_oversampled, dataset_test, 
            type="response")
  brier_score_logistic_oversampled[ii] <- 
    mean((prediction_logistic_oversampled - 
            (outcome_test==TRUE))^2)
prediction_randomForest <- predict(model_randomForest, 
                                     dataset_test, type="prob")
  brier_score_randomForest[ii] <-
    mean((prediction_randomForest[,2]-(outcome_test==TRUE))^2)
prediction_randomForest_oversampled <-

    predict(model_randomForest_oversampled, 
            dataset_test, type="prob")
  brier_score_randomForest_oversampled[ii] <- 
    mean((prediction_randomForest_oversampled[, 2] - 
            (outcome_test==TRUE))^2)
#sampling another testing dataset for "0" model
  predictors_test <- matrix(runif(nn_test * 
                                    (length(true_coefficients) - 1)), nrow=nn_test)
  logit_test <- cbind(1, predictors_test)%*%true_coefficients
  probability_test <- 1/(1+exp(-logit_test))
  outcome_test <- factor(runif(nn_test)<=probability_test)
  dataset_test_new <- data.frame(outcome=outcome_test, 
                             predictors_test)
prediction_logistic_0 <- predict(model_logistic_0, dataset_test_new[,-2], 
                                   type="response")
  brier_score_logistic_0[ii] <- mean((prediction_logistic_0 - 
                                        (outcome_test==TRUE))^2)
  prediction_logistic_oversampled_0 <-

    predict(model_logistic_oversampled_0, dataset_test_new[,-2], 
            type="response")
  brier_score_logistic_oversampled_0[ii] <- 
    mean((prediction_logistic_oversampled_0 - 
            (outcome_test==TRUE))^2)
prediction_randomForest_0 <- predict(model_randomForest_0, 
                                       dataset_test_new[,-2], type="prob")
  brier_score_randomForest_0[ii] <-
    mean((prediction_randomForest_0[,2]-(outcome_test==TRUE))^2)
prediction_randomForest_oversampled_0 <-

    predict(model_randomForest_oversampled_0, 
            dataset_test_new, type="prob")
  brier_score_randomForest_oversampled_0[ii] <- 
    mean((prediction_randomForest_oversampled_0[, 2] - 
            (outcome_test==TRUE))^2)
}
#close(pb)
round( quantile( (brier_score_logistic - brier_score_logistic_0)/brier_score_logistic_0), 3)
round( quantile( (brier_score_logistic_oversampled - brier_score_logistic_oversampled_0)/brier_score_logistic_oversampled_0),3)
round( quantile( (brier_score_randomForest - brier_score_randomForest_0)/brier_score_randomForest_0), 3)
round( quantile( (brier_score_randomForest_oversampled - brier_score_randomForest_oversampled_0)/brier_score_randomForest_oversampled_0), 3)

Hm. (1) Since we calculate Brier scores on different datasets by definition of oversampling, it is not surprising that relative differences in Brier scores differ between oversampled and non-oversampled datasets. I don't quite see how this is an advantage. (2) For the same reason, the Brier scores between the oversampled and the non-oversampled datasets are not comparable, so I don't quite see why it would be good that it is lower for the oversampled dataset. — Stephan Kolassa, Mar 20 '23 at 10:53
You write in the post: 'In each case, the predictive distributions derived from the full dataset are much better than those derived from an oversampled one.' What is that base on? I thought you mean either because the Brier is smaller or because of the smaller spread. To the first I counter and agree the absolute Brier doesn't matter. To the second: the reduced spread might actually be a weakness, because it diminishes the distinction between a better and worse model. When the dataset is exactly the same on each run it might not matter that much, but what about for example the minibatch method — StanW, Mar 20 '23 at 10:58
The predictive distribution is better in the sense that it is better calibrated, i.e., it gives a better probabilistic prediction of the true probability of a new instance to belong to the target class. The Brier score is simply a tool to arrive at well-calibrated probabilistic predictions. This is (another reason) why I am not so much interested in lowering Brier scores through oversampling - if the end result is miscalibrated, then the lower Brier scores are misleading. That said, I will try to digest your post more fully when I find the time. — Stephan Kolassa, Mar 20 '23 at 12:02
@StephanKolassa I do wonder if oversampling can help with the computational aspects (especially when it comes to deep learning), and then we can calibrate later. — Dave, Mar 20 '23 at 12:10
@Dave: that is indeed a possibility. It's just that the Brier score is so extremely well-behaved (it's quadratic - it won't get much politer) that I have trouble grasping whether a change in the gradient makes the numerics so much easier that it outweighs the later efforts we may need to recalibrate predictions. — Stephan Kolassa, Mar 20 '23 at 12:13
@Dave: agree. Just to mention class or observation weights as used for e.g. xgboost may work in a similar way by making individual class contributions to the total loss equal. — StanW, Mar 21 '23 at 19:32
"without balancing the neural network might just learn to guess the class with the higher probability without learning any data features if the dataset is too unbalanced." it really shouldn't do that with appropriate training parameters. This default classification can be implemented just using the output layer biases, which are close to the error signal, and so should be learned very quickly, but it should still be possible to reduce the training criterion further by modifying the inter-layer weights if there actually is a reliable relationship with the attributes. — Dikran Marsupial, Mar 29 '23 at 14:56
And if your features really are not predictive of the outcome, then the model is doing what it is supposed to by favoring the majority class most of the time. — Dave, Mar 29 '23 at 15:23
@Dikran Marsupial and Dave: As seen in the second box, the contribution of the imbalance to the absolute Brier is much larger than the contribution of the predictor (as the "0" model has a much lower Brier than the Model with predictor in the balanced setting). Sampling variation in the unbalanced minibatches will thus result in larger gradient than any predictor fitting. Thus I suggest that your comment only holds true if you are using the exact same imbalanced dataset for gradient computation in each epoch, however in SOTA neural network learning I've never seen that done. — StanW, Mar 29 '23 at 16:57
I have performed the simulation and edited the answer accordingly to support my statement. — StanW, Mar 29 '23 at 18:11
@StanW yes, of course the imbalance is going to have a bigger impact on the Brier score, but that doesn't mean that the rest of the neural network isn't going to extract what useful information there is in the attributes (even if it doesn't make a big difference to the Brier score). So if the NN is still finding a near optimal solution, the imbalance isn't a big issue (IMHO the main issue is that people don't think enough about the relevance of their performance metric to the application at hand). — Dikran Marsupial, Mar 29 '23 at 18:35
BTW, why use minibatches for problems as small as this, just compute the gradient - better to get the right answer slowly rather than the wrong answer quickly. By definition it isn't a SOTA neural network if it is outperformed by a conventional backprop MLP. Neural nets have a long history, but unfortunately much of it is lost in each hype-bust cycle, which is a pity. — Dikran Marsupial, Mar 29 '23 at 18:36
@Dikran Marsupial: I'd have to go back and re-read what the advantages of the batch-methods are, but it's not just neural networks, most ensemble approaches use bootstraps or subsamples of the data in each iteration (xgboost, random forest, stability selection etc.). And as the simulation clearly shows when training on imbalanced data the variation due to sampling may lead to the gradient pointing into the opposite direction (here 30% of iterations!). Basically the signal-to-noise ratio of the gradient is on the side of noise. So there may be a strong rationale for oversampling or class weight — StanW, Mar 29 '23 at 19:17
@StanW the batch idea is useful when dealing with very large amounts of data, but it can also speed up the initial stages of training (you don't need much data to give a good idea of where "downhill" is at the early stages). The stochastic aspect can also help with local minima, but it can also prevent you from converging to good minima as well. There is only a strong rationale for oversampling here if you are training the network in a manner that is unsuited to the problem (which is why I mentioned the learning parameters). If the SNR is too low, use larger and larger batches. — Dikran Marsupial, Mar 30 '23 at 06:36
@Dikran Marsupial: I think the OP‘s question is meant in general and not as a specific example. So you suggest in general to not use any sampling methods (batches, bootstrap, subsampling) during training when the data is unbalanced as opposed to using class weights or over/undersampling and recalibrating afterwards? Also larger batches do not seem to help if they are sampled (in the example above the "batch" is 10,000 observations!) — StanW, Mar 30 '23 at 10:00
@StanW no, of course don't recommend that, as I said batch training can be useful in reducing computatonal expense and in avoiding local minima early in training. However to use neural networks successfully you need to learn to diagnose when problems arise and adapt to them. As I say in my answer most of the time class imbalance is not a problem and doesn't require any action on those grounds. The real problem is they tend also to be cost sensitive learning problems but a lot of practitioners don't think about that and automatically reach for SMOTE etc. — Dikran Marsupial, Mar 30 '23 at 11:19
I used to think that resampling/re-weighting and then post processing the probability estimates might be worthwhile for some models, where there is imbalance, but I no longer think that is the case (at least where there are sufficient data). — Dikran Marsupial, Mar 30 '23 at 11:20
There has already been some good discussion of how we got into this mess. I think the root cause is a misuse of the word classification whereby some computer scientist made the leap that if Y is binary (i.e., has already been classified) that predictions should also be classified. This is almost never a good idea. Binary Y should lead to estimates of tendencies of Y (i.e., probabilities). This is no more true than in the rare event (imbalanced) case. — Frank Harrell, Sep 02 '23 at 12:08