2

I was wondering if anyone could help me understand resampling for class imbalance. From what I have learned, class imbalance is usually a small data problem where the less prevalent class usually cannot be observed enough to really help inform a model to separate it. If the number of observations were sufficiently large such that the less prevalent class had sufficient density, I imagined that any of these resampling methods wouldn't help too much. However, many blog posts and fellow Data Scientists insist to resample.

If anything, I am concerned that resampling the minority class will bias the model too much toward the observations. In an extreme example, you can imagine having just one point that is a minority class. This one point is unlikely to represent the whole space that can be the minority class, but if you were to upsample it 1000x let's say, the model may mistakenly have too much confidence in this specific area (and also not others?).

So here is my python code on resampling. I did it pretty roughly so any comments to make it more genuine for the theoretical problem is appreciated. I didn't run the full gamut of all resampling methods because I think they are all somewhat similar in the application of the theory. I am using the AUC of the ROC as a gauge of separation. Upsampling the minority class does not seem to help much except in the case of very low noise volatility (noise_var == 1 for example) where the space occupied by the observed minority class can seem more reliable to set strong thresholds.

import numpy as np
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.utils import resample
import matplotlib.pyplot as plt

training

n = 1000

x1 = np.random.gamma(1, size=n) x2 = np.random.uniform(0,100, n) x3 = np.random.normal(0,3, n)

noise_var = 100 y = 2 + .025x1 - 5x2 + 6*x3 y_with_noise = y + np.random.normal(0,noise_var,n) print('Variance ratios:',np.var(y),np.var(y_with_noise),np.var(y_with_noise)/ np.var(y))

p = np.exp(y)/(1+np.exp(y)) obs = np.random.binomial(1, p, n)

X = np.column_stack((np.ones(n),x1,x2,x3))

model1 = LogisticRegressionCV().fit(X,obs)

print('Observed Training Targets Original:', sum(obs)/len(obs))

train with upsampling

upsample = obs == 1 upsample_X = X[upsample]

newX = resample(upsample_X, n_samples=sum(obs!=1))

X = np.concatenate((X[obs!=1],newX)) obs = np.concatenate((obs[obs!=1],np.ones(len(newX))))

model2 = LogisticRegressionCV().fit(X,obs)

print('Observed Training Targets Resampled:', sum(obs)/len(obs))

prediction

n = 1000000

x1 = np.random.gamma(1, size=n) x2 = np.random.uniform(0,100, n) x3 = np.random.normal(0,3, n)

y = 2 + .025x1 - 5x2 + 6*x3 + np.random.normal(0,noise_var,n)

p = np.exp(y)/(1+np.exp(y)) obs = np.random.binomial(1, p, n)

print('Observed Holdout Targets:',sum(obs), sum(obs)/1e6)

X = np.column_stack((np.ones(n),x1,x2,x3))

prediction = model1.predict(X) prediction_prob = model1.predict_proba(X)[:,1]

fpr1, tpr1, _ = roc_curve(obs, prediction_prob)

print('ROC AUC plain:',roc_auc_score(obs, prediction_prob))

prediction = model2.predict(X) prediction_prob = model2.predict_proba(X)[:,1]

fpr2, tpr2, _ = roc_curve(obs, prediction_prob)

print('ROC AUC resampled:',roc_auc_score(obs, prediction_prob))

plt.plot(fpr1, tpr1, ':', label='original') plt.plot(fpr2, tpr2, '--', label='resample') plt.legend()

  • I think this is a much better place to post than Data Science, but please delete that if you’re going to post here. – Dave Sep 18 '22 at 01:57
  • I will do so. I wasn't sure which would be better because it's a common DS idea I feel I find online. – dzheng1887 Sep 18 '22 at 01:59
  • As to why data scientists think oversampling is useful, see the discussion in the comments at the proposed duplicate. Short version: oversampling looks like a "solution" to a "problem" that actually comes from using inappropriate quality measures. – Stephan Kolassa Sep 18 '22 at 06:58
  • Unfortunately a lot of misinformation is spread via blogs where practitioners are giving recipes for data science problems (usually with python code), which are often applied without attempts to diagnose problems or properly check the solution actually worked in practice. Any data scientist with a method to diagnose when imbalance is actually a problem please answer my question here: https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance ... – Dikran Marsupial Sep 18 '22 at 08:32
  • ... or if they can give a concrete example where resampling improves accuracy, give it here: https://stats.stackexchange.com/questions/559294/are-there-imbalanced-learning-problems-where-re-balancing-re-weighting-demonstra . The lack of substantive questions to these questions (even when a modest bonus was on offer) speaks volumes. – Dikran Marsupial Sep 18 '22 at 08:34
  • I disagree about accuracy being an inappropriate performance metric. Sometimes it is the quantity of interest for your application. However even then it is not necessarily the best model comparison/selection criterion. Performance evaluation and model selection are not the same thing. – Dikran Marsupial Sep 18 '22 at 08:36
  • It might be better to leave the question at the Data Science SE but give a link to the question here. It may be beneficial to have some communication between the two communities on this issue! – Dikran Marsupial Sep 18 '22 at 08:47
  • 1
    @DikranMarsupial I’ve been planning (for a year) to post a question in there asking why data science sees class imbalance as a problem when statistics mostly does not. Hopefully I’ll post it some day. I am curious to read their responses. – Dave Sep 18 '22 at 14:33
  • Thank you for all these comments. I thought I was crazy. I get this question asked in interviews all the time too, and just felt like I was missing something. Yes, I am also looking for some first principles reasoning why resampling data helps this problem, because it is not clear to me. – dzheng1887 Sep 18 '22 at 15:47
  • @dzheng1887 resampling can be useful for some classifiers as it is a means of implementing cost-sensitive learning (unequal false-positive/false-negative costs), which is quite common in imbalanced learning tasks. The first thing to do is to ask what the misclassification costs should be for your application. Imbalamce can be a problem if there isn't enough data, but mostly it is cost-sensitive learning issue in disguise. – Dikran Marsupial Sep 18 '22 at 17:06
  • @Dave do point them to my two questions here, it would be good to see if they have any answers from DS SE – Dikran Marsupial Sep 18 '22 at 17:07
  • 1
    @DikranMarsupial I do plan to post it someday, and I will link it in a comment to one of your questions when I do. – Dave Sep 18 '22 at 17:57

0 Answers0