classification in imbalanced datasets: how to measure performance on test set?

Question

I am using re-sampling methods to address the imbalance between classes for my binary classification problem.

I am not sure how to measure the performance of my model on the test set:

should I re-sample the test set to have an idea how my model is performing on the test set compared to the training set?
or should I measure the performance on the original test set and, to compare it with the training set, also measure the performance on the original training set?

I use resampling in order to give a higher weight to positive samples, which belong to the class that is the most important for me (misclassifying a positive sample is more costly from a business point of view than misclassifying a negative sample).
I would think that my second bullet point is the best methodological approach, but my first bullet point still makes some sense IMO (from a performance point of view, not a business point of view). — Tanguy, Aug 28 '17 at 08:43
Why not fit a probabilistic model? I.e. one that gives probabilities of class membership? — Matthew Drury, Aug 28 '17 at 13:06
In thise case, I am using RF as they are performing pretty well. I believe RF can not be qualified as probabilistic models in classification, but I guess the probability of a class can be approximated by the voting proportion to each class. Could you please guide me in showing how this could relate to the re-sampling methodology in order to correctly measure the model's performance? — Tanguy, Aug 28 '17 at 15:51
I can't really help you with resampling, because I don't understand it very well. I've been vocal about my suspicion that it is possibly snake oil. On RF, it is a probabilistic model, as each tree is probabilistic by averaging the labels in each terminal node. — Matthew Drury, Aug 28 '17 at 16:58

score 1 · Answer 1 · answered Aug 27 '17 at 10:54

1

The most realistic option would be to resample your training set however you see fit to train a model that can deal with the class imbalance and to leave the test set untouched to test if that model then can actually deal with imbalance. If you take the imbalance out of the test set, you will not see if your model can deal with it or not.

If you need to do a lot of model tuning and model choice, I would also do a CV within the training set, do the resamplings each time within the training folds and compare your models in the untouched other fold. Once you have chosen a model and it's parameters, you may retrain it in a resample training set that includes all the folds and (without changing the parameters of he model at this point) and obtain a performance measure from the test set.

Don't use accuracy to measure performance at any point in time since all your out of sample predictions will be made on imbalanced sets.

answered Aug 27 '17 at 10:54

David Ernst

3,151
10
15

1

I am still unconvinced that resampling solves any actual problem. If you have a strong understanding of this technique, please offer an answer to my quesitons here: https://stats.stackexchange.com/questions/285231/what-problem-does-oversampling-undersampling-and-smote-solve, https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem – Matthew Drury Aug 29 '17 at 13:32
I was mostly answering his questions within the framework of resampling. Personally, I prefer changing the cutoff and if that doesn't help perhaps feature engineering. I didn't dismiss his premise though since for example in the book applied predictive modelling it is presented as one way to deal with imbalance. Within the resampling framework, the most important point was to not resample your test set or you will never know how you perform. – David Ernst Aug 29 '17 at 13:36
1

I know this is commonly presented in the literature, and in many text books, yet I cannot get a good, straight answer at exactly what it means to "deal with unbalanced data". I've routinely fir gradient boosted models (minimizing log loss) to 200/1 balanced data with no issues. Why is imbalance a thing that must be "dealt with"? – Matthew Drury Aug 29 '17 at 13:43
Because many algorithms will implicitly optimize accuracy which is inadequate with strong class and cost imbalance. If you have found an algorithm that doesn't do that for the data-sets you had, all the better. – David Ernst Aug 29 '17 at 13:59
1

Which algorithms are you thinking of? Most classifiers in common use optimize log-loss (trees, random forest, gradient boosting, neural networks, regression, additive models, etc). Of the well known algorithms, only SVM falls into that class, and this is probably one large reason it has fallen out of favor in the current day. To put a blunt point on it, if you are ever optimizing accuracy, you have a pretty large conceptual problem. – Matthew Drury Aug 29 '17 at 14:02
Nearest neighbor based methods and graph based label propagation methods will wipe out your minority class as will decision trees if the class is too rare. There are precautions you could take to bias them towards seeing the minority class, but they algorithm will not know how important the rare records are. – David Ernst Aug 30 '17 at 07:16

classification in imbalanced datasets: how to measure performance on test set?

1 Answers1