Questions tagged [unbalanced-classes]

Data organized into discrete categories or classes may present problems for certain analyses if the number of observations ($n$) belonging to each class is not constant across classes. Classes with unequal $n$ are unbalanced.

Data organized into discrete categories or classes may present problems for certain analyses if the number of observations ($n$) belonging to each class is not constant across classes. Classes with unequal $n$ are unbalanced. This tag should be used for questions about datasets with subsamples of unequal size where imbalanced distributions across categorical factors is of concern.

Analyses with known, non-negligible sensitivity to unbalanced classes include (but are not limited to):

Reference

Howell, D. C. (2009). Unequal sample sizes do matter. University of Vermont. Retrieved from http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Unequal-ns/unequal-ns.html.

1053 questions
25
votes
1 answer

Balanced accuracy vs F-1 score

I was wondering if anyone could explain the difference between balanced accuracy which is b_acc = (sensitivity + specificity)/2 and f1 score which is: f1 = 2*precision*recall/(precision + recall)
dvreed77
  • 747
12
votes
0 answers

Are there Imbalanced learning problems where re-balancing/re-weighting demonstrably improves *accuracy*?

I have been looking into the imbalanced learning problem, where a classifier is often expected to be unduly biased in favour of the majority class. However, I am having difficulties identifying datasets where class imbalance is genuinely a problem…
Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204
5
votes
1 answer

When is dataset considered unbalanced?

I have data set which is highly unbalanced - target attribute is 93% False and 7% True. But I know that this is normal for my kind of data. I am afraid that if I undertake any steps (I can take less False cases for example), I skew the distribution…
HonzaB
  • 683
3
votes
1 answer

Modelling with Unbalanced dataset

I am working with a fairly unbalanced dataset (event class < 5% - it's a binary classification problem). To deal with this imbalance, I am trying out various techniques such as Oversampling the minority class (as well as synthetically generating…
Dataminer
  • 375
2
votes
1 answer

When should we avoid balancing data

Can someone point me to some resource (textbook,paper,blog,..) that clearly explains when we should NOT balance data for classification/regression? I found…
2
votes
0 answers

Unbalanced data and undersampling

When using undersampling to compensate for unbalanced data, what should you use for a testing dataset?
AngusE
  • 21
2
votes
0 answers

Imbalanced learning - under sampling vs. over sampling vs. weight based classifiers

Does anyone know what is the difference (theoretically speaking) between under sampling over sampling weight based classifiers when dealing with highly imbalanced datasets (1:1000, 1:10000)? When is it recommended to use each one? Is there a…
YinnonM
  • 21
2
votes
1 answer

How do Adasyn and SMOTE handle categorical data, specifically binary features?

SMOTE oversamples the minority class by creating synthetic data along the line connecting a minority class sample with each (or how many ever are chosen) of its K neighbors. In other words, xnewsample = xoldsample + lambda*(xneigbhor - xoldsample).…
1
vote
0 answers

Why class-balancing techniques are sometimes useful?

There are a lot of questions here regarding when to do class balancing, or what to expect of class balancing or whether unbalanced classes are an issue at all. Apparently the "consensus" among most of the top answers on these questions is that, for…
1
vote
1 answer

Should I upsample both my training as my test set?

I have a highly unbalanced dataset (1000 vs 60). Where I want to use upsampling. The real life distribution of the problem (predicting no show) is probably also very highly imbalanced. My question is two-fold 1) I know that I should keep the…
1
vote
0 answers

Counter intuitive in AUPRC and Recall and Precision and F1 for imbalanced dataset

I would like to ask for some details explanation on comparing several classifiers for imbalanced dataset using the following metrics: Area under the ROC curve, AUC Area under the Precision-Recall curve, AUPRC Recall Precision F1 Score As my data…
1
vote
0 answers

Unbalanced distribution of multi-classes, how can I divide training/testing set

experts of the statistics, I am a newbie student in the machine learning field. I just started a job to classify set of scientific abstracts into five classes. The text distribution is as below: Class1: 200 Class2: 950 Class3: 150 Class4:…
W Lee
  • 11
1
vote
0 answers

Is there any built-in MSMOTE library?

I am trying to deal with data imbalance within a small dataset. Just found an article talking about SMOTE and MSMOTE here It seems that MSMOTE can overcome the shortages of SMOTE, so I really want to try it. MSMOTE paper is published in 2009,…
Cherry Wu
  • 331
  • 2
  • 11
1
vote
2 answers

Bias-Variance tradeoff for classifying unbalanced classes

I would like to use Bias-Variance trade-off to evaluate training set size in a classification problem. There are two classes which are not balanced (~70/30) and it seems that the common use of misclassification error is not good enough. Which…
Eitan
  • 131
1
vote
2 answers

How to judge a partition is balanced or unbalanced?

Suppose we distributed $100$ coins to $10$ persons and the $i$-th person got ${x}_{i}$ coins, how to judge the distribution $X=\{{x}_{1}, {x}_{2}, ..., {x}_{n}\}$ (e.g., $X=\{5, 20, 15, 5, 10, 10, 10, 15, 5, 5\}$) is (almost) balanced or not? Is…
Lijie Xu
  • 123
1
2