Edited nearest neighbor (ENN) under-sampling? When is that useful?

Asked Mar 26 '24 at 09:07

Active Mar 26 '24 at 13:32

Viewed 11 times

Tackling the problem of unequal group sizes, one would often intuitively think of increasing the number of observations in the minority group. But sometimes the opposite can also be useful, i.e. decreasing the number of observations in the majority group. There are different methods for doing this, in particular Edited Nearest Neighbours. A simple definition may be found on imbalanced-learn.org.

I am struggling to understand why this would be done. How can a model generalize if we eliminate points which are near the boundary? I mean, this can be done easily on the training data, but if we apply the model on unseen test data, those points will probably occur again and distort the performance. Maybe you could elaborate with an example where this is useful, with comparison to random oversampling, which naively appears way more attractive, especially when we care about sample size. Thanks!

edited Mar 26 '24 at 13:32

SusanW

asked Mar 26 '24 at 09:07

Marlon Brando

“Problem” of unequal group sizes? What problem do unequal group sizes cause? Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? – Dave Mar 26 '24 at 11:49

Edited nearest neighbor (ENN) under-sampling? When is that useful?

0 Answers0