Could we explain the disadvantage of imbalanced data mathematically?

Question

Simple setup:

observed response is binary (yes/no, 0/1, positive/negative).
use logistic regression to model the probability of the response being, say, 1: $P(Y=1|X)$.
the MLE of the model coefficients enable us to predict the probability by $\frac{1}{1+exp(-X\hat{\beta})}$.
By checking ROC and AUC, we find the most appropriate threshold of $p_{0}$ such that $\hat{y}=1$ if $\hat{p}>p_{0}$, otherwise $\hat{y}=0$.

In this process, could we mathematically show that $P(Y=1|X)$ is underestimated if we have 10000 0's and 100 1's? How is this bias quantified in formula?

Why would $P(Y=1\mid X)$ be underestimated? I can see that it might be low and so you might sensibly never predict $Y=1$, but that is a different issue — Henry, Oct 02 '22 at 03:30
I only got exposed to the undersampling/oversampling practice when I started to self-learn about machine learning. It seems to be a common practice in the ML community when the positive:negative rate is extremely away from 50:50. See link for example. — lambda, Oct 02 '22 at 03:40
@Henry, also this article. I don't have access to it, but here is the abstract: "we demonstrate that class probability estimates obtained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration" — lambda, Oct 02 '22 at 03:55
I did some very simple experiments on this and you can get a situation where overfitting causes the classifier to over-predict the minority class, where by chance the training data are separable. There is no real class imbalance problem per se, it is mostly a lack of understanding of cost-sensitive learning and a lack of consideration of the misclassification costs. Often classifying everything as belonging to the majority class is the optimal solution, and you shouldn't try and prevent that. — Dikran Marsupial, Oct 02 '22 at 12:00
It is only a problem where you have too little data, see the paper by King mentioned in the answer by @seanv507 (+1) — Dikran Marsupial, Oct 02 '22 at 12:01
We have several related threads on this evergreen topic, discussed a bit at https://stats.meta.stackexchange.com/questions/6349/profusion-of-threads-on-imbalanced-data-can-we-merge-deem-canonical-any . This is a good place to start: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he — mkt, Oct 02 '22 at 12:38
@DikranMarsupial, could you elaborate a bit on the cost-sensitive and misclassification cost? e.g. is it referring to a False Negative in cancer diagnose should be avoided more than a False Negative in ad click propensity? — lambda, Oct 02 '22 at 16:11
@lambda yes, cancer diagnosis is a very good example of unequal misclassification costs. If you have a probabilistic classifier, you can just change the decision threshold appropriately, but resampling or reweighting the training data is a broadly equivalent approach. So whether you resample should depend on the costs, and the amount of resampling doesn't depend on the imbalance at all. A lot of fuss is made of class imbalance, but usually the classifier is doing the optimal thing for equal misclassification costs. If that is unacceptable, it implies the misclassification costs are unequal — Dikran Marsupial, Oct 02 '22 at 16:15

score 12 · Accepted Answer · answered Oct 02 '22 at 08:02

Not a formal proof, but an intuition: Unbalanced data is not a problem per se, the problem is that you don't have many samples to represent the minority class. Imagine a trivial model, where you would use logistic regression with only the intercept for your data. In such a case, the model would correctly estimate the probability to be $\tfrac{100}{10100}$ (try it yourself on different datasets). Now imagine that you use a more complicated model, the model would start struggling a little bit for the minority class because it doesn't have much data for it. Imagine a different problem, where you take only the minority class data ($n=100$ here) and try fitting some complicated model to this data. It will fail simply because you don't have enough data. For exactly the same reason, your predictions regarding the minority class would be not as precise as with the majority class, because you don't have enough data to represent it.

another way of looking at it is that the MLE is asymptotically unbiased, so as you add more data, the problem will go away. — Dikran Marsupial, Oct 02 '22 at 11:57
@tim and dikran-marsupial, following this lead, it means this imbalanced data is more concerning in 1000:10 than in 1000K:10K, despite the same degree of imbalance rate. The asymptotic efficiency in MLE handles it.
But what I see in the ML literature or how-to is quite a blank check to undersample whenever imbalance happens. — lambda, Oct 02 '22 at 16:08
@lambda as you can learn from the threads linked above, there are many, quite common, misconceptions about unbalanced data. — Tim, Oct 02 '22 at 16:46

seanv507 · Answer 2 · 2022-10-04T19:08:16.007

In general statisticians are not worried about bias in imbalanced data (not a problem per se), since

they use probabilistic classifiers like logistic regression
the bias (in small samples) of logistic regression is orders of magnitude smaller than the variance.

So it's my personal belief that ML researchers have been 'fooled by randomness'. In addition SVMs and random forests are not (or not good) probabilistic classifiers, and there rebalancing might have been beneficial. gradient boosted models are well calibrated probability classifiers so this is no longer an issue.

There is a relatively recent freely accessible paper (https://gking.harvard.edu/files/0s.pdf) that covers your particular question for logistic regression, and reports the estimated bias which has been known since the 60s?

Although the paper seems to argue for the importance of bias correction, their results show that it's pretty immaterial for probability estimation ( until you go to 3 standard deviations from the mean of their normally distributed input data). see figure 6. Infact the debiased model gives worse results than both standard logistic regression and their proposed bayesian regression. I interpret these results as consistent with the argument that the variance is the big problem with imbalanced data and standard regularisation approaches (like bayesian regression) have the most to offer to fix the problem.

@lambda, So I tried to find the theory paper (from the same author) that the paper you mentioned references.

Class Imbalance, Redux. I have only access to a summary wallace imbalance. It seems like the authors took the heuristic explanation given by King et al. (" To see this intuitively, and only heuristically...") and promoted it to a theory. The suggestion seems to be that the maximum is biased low (as function of sample size). So in the perfect separation case, a decision boundary set between the extrema of majority and minority class will be biased (towards the minority class). Undersampling the majority class will help this (oversampling the minority will not help). Adjusting costs will not help this because you are already correctly classifying all the negative class correctly.

This assumes that the maximum is actually used in the ML algorithm. As alluded to by @dikran-marsupial, this seems to depend on perfectly separating the data, it's not a general property (eg most clearly linear discriminant analysis only uses means and covariances of the class). Another proviso is this assumes numeric and not categorical data.

I do think this is an interesting argument and it might explain why this is something that ML people training trees etc to perfect classification hit this problem, whilst focussing on probability calibration avoids this.

Thanks @seanv507. It is quite to my surprise to see this commonly used under/oversampling tactic in ML community, because I never heard of it in stat classes.
Is it safe to say the parametric methods like logistic regression, we don't need to calibrate against the imbalance data issue? — lambda, Oct 02 '22 at 16:21
yes. The only 'statistical' reason for calibrated probabilistic classifiers to do under sampling is computational/data collection (as alluded to in King paper) and then you reweight to get back the true probabilities. — seanv507, Oct 02 '22 at 16:48

Sextus Empiricus · Answer 3 · 2022-10-02T12:33:47.810

There is indeed a bias with logistic regression and maximum likelihood estimation when the classes are not equal. Below is a demonstration by coding a simulation (example in literature here)

set.seed(1)
sim = function(n1 = 500, n2 = 20) {
data
x1 = rnorm(n1,-1,1)
  x2 = rnorm(n2,1,1)
  x = c(x1,x2)
  y = c(rep(0,n1),rep(1,n2))
model plus correct for undersampling
mod = glm(y ~ x, family = binomial())
  coefficients(mod) + c(log(n1/n2),0)
}
sims = replicate(10000,sim())
layout(matrix(1:2))
hist(sims[1,], main = "intercept estimate should be 0")
hist(sims[2,], main ="slope estimate should be 2")
rowMeans(sims)

Possibly you could proof this mathematically. Intuitively it is not unsurprising when a maximum likelihood, which is not designed to be zero bias, to be biased. If the penalties for false positives and false negatives are different then it might be good to add some bias.

In the case of a contingency table we will also get a bias

$$\begin{array}{} & x=0 & x= 1 \\ y = 0 & 1-p_0 & 1-p_1 \\ y = 1 & p_0 & p_1 \end{array}$$

The bias does not occur in the estimates of $p_0$ and $p_1$, which is like estimating parameters of independent Bernoulli distributions, which should be unbiased. However, the slope parameter of the curve that we draw through the points will be biased.

$$\hat{\beta}_{slope} = \log\left(\frac{\hat{p}_1}{1-\hat{p}_1}\right) - \log\left(\frac{\hat{p}_0}{1-\hat{p}_0}\right)$$

Also, it could be a form of regression attenuation, the bias of a regression curve being closer towards zero, when the x-values have measurement error. When one of the categories is rare (away from odds equal to 1), then this attenuation will be stronger.

I imagine more strong bias to occur when you have complex patterns and a model is better able to learn one class and not the other class.

Another type of bias that relates to this training difference is here: Was Amazon's AI tool, more than human recruiters, biased against women?

Some recruiter tool might select based on some criteria the best men and women. But due to underrepresentation of women they might be underrepresented in the high level criteria. Then you could have a situation as in the image below. Even when on average women might be more often in the class of a good candidate, it is mostly men that could end up with a higher class probability probability because the model trained on men and got very good and seperating them.

frank · Answer 4 · 2022-10-02T15:27:51.393

4

Not all strongly unbalanced data would result in logistic regression underestimating the probability of the minority in a relevant way. If the data is perfectly separated (and logistic regression is using proper regularization), there will be perfect accuracy.

But, of course, lots of unbalanced data sets will lead to relevantly underestimating the probability of the minority class. And the mathematical proof for each of those data sets would be to simply apply your choice of logistic regression to this data set and check the resulting accuracy.

Here is a link to the paper you mentioned, maybe you can access this one. In there, see Figure 3, they give an intuitive example that shows why we often observe underestimation.

edited Oct 02 '22 at 15:27

answered Oct 02 '22 at 05:46

frank

10,797

Figure 3 there isn't very helpful to me; in that case the minority examples really do occur less often, so the sigmoid looks fine. (Well, actually, it looks like the training data may be perfectly separated, so without regularization I'd expect a sharper sigmoid.) – Ben Reiniger Oct 02 '22 at 14:53
Thank you @frank. If we choose a lower threshold in the red sigmoid in Fig. 3, wouldn't that correct the so-called underestimation? – lambda Oct 02 '22 at 15:05
@BenReiniger Yes, I was more referring to the overlapping of the two distributions, but indeed, it looks like there is perfect separation. – frank Oct 02 '22 at 15:18
@lambda If the data is completely separated, which I originally (probably wrongly) assumed it wasn't, then optimization, appropriately regularized, should only very mildly underestimate the class probability and should thus lead to perfect classification accuracy. – frank Oct 02 '22 at 15:24

Hunaphu · Answer 5 · 2022-10-03T14:04:31.840

-1

Not an explanation but an example illustrating that strong imbalance forces dismissal of the underrepresented class.

Let $y = y_1, ..., y_N$ be a sample where $y_k = 0$ except for $y_1 = 1$. Also, assume that you have observations $x_k$ such that the true probability $p(y_k = 1 \mid x_k) = 0.5$ for all $k$.

If $m(x)$ is your model output, the MSE is given by $$ \frac{1}{N}\sum_{k=1}^N (y_k - m(x_k))^2 = \frac{1}{N}(1 - m(x_1))^2 + \frac{1}{N}\sum_{k=2}^N m(x_k)^2 $$

As $N$ tends to infinity, the first term disappears and the optimal model becomes indistinguishable from $m(x) = 0$. What is worse, the true model ($m(x) = 0.5$) gives a very large error.

edited Oct 03 '22 at 14:04

answered Oct 02 '22 at 22:34

Hunaphu

2,212

2

This isn't at all convincing, to me. Taking $N\to\infty$ with only one positive example the whole time is well beyond "extreme imbalance". And even then, if the problem is separable the optimal model actually does still always differ from the constant 0 model. At every finite $N$ the optimal model, even in the inseparable case, will probably manage to trade a(n increasingly) tiny bit away from the negative class to improve the positive example prediction. And in terms of global calibration, the right answer in the limit is $p\equiv0$. – Ben Reiniger Oct 03 '22 at 00:39
@BenReiniger Your statements are either wrong or not falsifiable. – Hunaphu Oct 03 '22 at 03:19
Please do point out the wrong ones. // You've now updated the example to involve a fair coin that flips infinitely many tails and only one heads, which doesn't help imo. – Ben Reiniger Oct 03 '22 at 12:14

Could we explain the disadvantage of imbalanced data mathematically?

5 Answers5

data

model plus correct for undersampling