0

Tackling the problem of unequal group sizes, one would often intuitively think of increasing the number of observations in the minority group. But sometimes the opposite can also be useful, i.e. decreasing the number of observations in the majority group. There are different methods for doing this, in particular Edited Nearest Neighbours. A simple definition may be found on imbalanced-learn.org.

I am struggling to understand why this would be done. How can a model generalize if we eliminate points which are near the boundary? I mean, this can be done easily on the training data, but if we apply the model on unseen test data, those points will probably occur again and distort the performance. Maybe you could elaborate with an example where this is useful, with comparison to random oversampling, which naively appears way more attractive, especially when we care about sample size. Thanks!

SusanW
  • 171

0 Answers0