SMOTE vs Stratified Sampling in highly imbalanced dataset - classification

Question

I am working on a project with the goal of predicting Cerebral strokes from brain arteries data (speed of blood, resistance etc. of one artery and of the neighboring ones).

I have a dataset with labeled data but it's highly imbalanced: patients with stroke represent a minority, hence the models (tried RF, & some boosting) predicting always 'non stroke'.

I am looking for the most efficient ways for dealing with that, and these are my options:

Tackle class imbalance - SMOTE or Stratified sampling? what are the differences between these two?
[BONUS] Tackle through a loss function penalizing false negative over FP - any idea of an efficient way to do that?

You don't want to classify, you want to predict risk. Resampling the minority is going to ruin your risk estimates because the prevalence is different than what is observed in the population. See my answer here (https://stats.stackexchange.com/questions/558942/why-is-it-that-if-you-undersample-or-oversample-you-have-to-calibrate-your-outpu/558950#558950) for more. — Demetri Pananos, Mar 31 '22 at 22:41
My usual class imbalance links (which now includes Demetri’s answer): https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://stats.stackexchange.com/questions/558942/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Apr 01 '22 at 01:22

score 1 · Answer 1 · answered Apr 02 '22 at 07:23

Use a probabilistic classifier (e.g. logistic regression, kernel logistic regression or Gaussian process classifiers) that outputs the probability of a stroke rather than a hard 0/1 classification. That way you can factor in unequal misclassification costs without refitting the model (it is just a change of threshold).

In general, classifiers don't have an undue bias against the minority class, and if all patterns are being assigned to the majority class, that is the optimal decision for equal false-positive and false-negative costs (see my related question), so you don't need to do anything about the imbalance. The reason for balancing etc. is actually because the misclassification costs are not equal, but this has nothing to do with the imbalance per se, it is just that imbalanced learning problems tend to be ones with obviously unequal costs. For example, I suspect it is a far worse error to classify a patient as not being at risk of stroke when they are (they may go home untreated, have a stroke and die) than to classify them as at risk of stroke when they are not (they will just be subjected to unnecessary further testing). We should be considering misclassification costs anyway, regardless of imbalance.

If you have a very small dataset, then the classifier may have an undue bias against the minority class (see my answer to this related question), in which case it principle it would be worth resampling or re-weighting to compensate for this bias, however as the data are very scarce it is difficult to judge how much compensation is required, so collecting more data to get rid of the bias is likely to be the only reliable solution.

I disagree with some others on this issue, but it is possible for discrete classifiers to give better 0/1 classifiers than probabilistic methods. This is because they focus on the contour of the probability of class membership that actually matters and are less likely to be distracted by modelling features of the probability of class membership that don't affect the 0/1 classification. For a synthetic example, see my question here. In that case, you may want to try discrete classifiers, and some of those cannot accommodate unequal misclassification costs, and so resampling the data may be required. Note however, I asked a question (with a small bonus) asking for practical examples where this worked and it went unanswered.

Lastly I would advise against using SMOTE for modern classifiers that can accommodate unequal misclassification costs (like the SVM where you can have different $C$ values for the positive and negative classes) and which have a means of dealing with overfitting (again for the SVM the $C$ parameters and also the kernel parameters). The SVM has a lot of theory behind it, but the regularisation implemented by SMOTE by blurring the training examples is only heuristic and is only really likely to be beneficial if you are using a very basic classifier, like a decision tree.

Minor note: These is not a probabilistic classifier; it is a probability estimator. And I've never seen anyone who studied statistics for a few years want to use SMOTE. — Frank Harrell, Apr 02 '22 at 14:41
@FrankHarrell Can you point out some references where I can learn more about the disadvantages of using SMOTE? I'd like to read more about why its use is discouraged. — stateMachine, May 11 '23 at 04:16
@stateMachine I don't know of references on that topic (although I am performing some experiments along those lines at the moment), however the issue is that SMOTE is basically a hack with no theoretical underpinnings or justification. It does work for primitive classifier systems (such as those mentioned in the SMOTE paper) that have no means of implementing cost-sensitive learning or avoiding over-fitting, but these days almost all classifier systems have both of those things, with the advantage of better theoretical justification. Not sure why SMOTE is still used nearly as much as it is. — Dikran Marsupial, May 11 '23 at 07:35
Here is a definitive reference: https://academic.oup.com/jamia/article/29/9/1525/6605096 — Frank Harrell, May 11 '23 at 11:17
@stateMachine BTW I am an engineer by background, so I can appreciate a judicious hack every now and again in the right circumstances. "heuristic" might be a more diplomatic term ;o) — Dikran Marsupial, May 13 '23 at 10:14

SMOTE vs Stratified Sampling in highly imbalanced dataset - classification

1 Answers1