2

Let's say that I am building a classifier on imbalanced data. A sample of the data set looks like:

Person   Time1    Time2    Time3    Injury
A        3        2        3        0
A        3        3.4      1.2      0
A        2        2.1      2.1      1
B        0        2        2        0

etc. I want to use Person, Time1, Time2, and Time3 as features to classify Injury (this is just an example I'm making up). Now let's say that in my target Injury I have value counts of:

Label    Count
0        9000
1        50

I want to use SMOTE to both under-sample the majority class and over-sample the minority class. This is easy enough if I'm only using the numerical variables, but what do I do in this case where I have a grouping variable?

It theoretically is OK to have multiple positive Injury cases within any given Person. But how do I setup the SMOTE algorithm such that when it finds the kNN's and then generates the synthetic points between the kNN's and itself, that it retains the particular Person label of that data point?

user1566200
  • 1,047
  • 1
    I strongly suspect one has to customise the original SMOTE algorithm to do this. Notice that distances are not well-defined between categorical variables so the concept of nearest-neighbours is murky. You could start having custom distance metrics (say some hybrid of Mahalanobis and Hamming but that a horrible exercise) My first approach would be not to use resampling to balance the dataset but actually focus on proper metrics that would adequately penalise the misclassification of minority class examples as well as classifier who do not do strong parametric assumptions. (Good basic question.) – usεr11852 Jul 08 '17 at 10:58
  • 3
    Are you sure you need to use SMOTE at all? It’s really hard to recommend the best way to do something when there’s a strong argument that you shouldn’t be doing it in the first place. – Dave Oct 22 '22 at 05:51

1 Answers1

0

It's very late but SMOTENC() is the correct method to do the oversampling for the mixture of categorical and numerical variables.

imblearn.over_sampling.SMOTENC

Mehdi
  • 210