Is it really so bad to do SMOTE on the training set before crossvalidation?

Question

I understand that doing this leads to data leakage, but if I get better performance on the test set does it really matter?

I tried using caret with sampling = "smote" to use SMOTE on each fold individually, but I just don't have enough samples on the minority class when split into train/test.

I'm only interested in finding features that separate my two classes.

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Stephan Kolassa, Feb 08 '24 at 14:32
"I get better performance on the test set" - do you mean a separate hold-out test set, not the cross-validation test folds? — Ben Reiniger, Feb 08 '24 at 14:45
I don't understand the purpose of this question. Are you checking to see if the advice has changed from all the previous times this questions has been asked and answered? — dipetkov, Feb 08 '24 at 14:52
@dipetkov no, I'm trying to understand if improvement in the classification of the test set could mean that the model is still valid. The other times I saw this answered I gathered that people got bad results when applying their models to the test set — maglorismyspiritanimal, Feb 08 '24 at 15:06
@Ben Reiniger yeah, I split the data into train/test and then I used SMOTE and fit the model with crossvalidation only on the train data. Then I just used "predict" on the test data — maglorismyspiritanimal, Feb 08 '24 at 15:07
And what did you use the cross-validation to do? Tune hyperparameters, select features, ...? — Ben Reiniger, Feb 08 '24 at 15:11
@Ben Reiniger both, I'm doing LASSO and I have to tune lambda but I also want to use the model for feature selection — maglorismyspiritanimal, Feb 08 '24 at 15:13

Ben Reiniger · Accepted Answer · 2024-03-11T17:25:19.047

To start, let's point out that using SMOTE on the entire dataset before a test set is split off is bad: as you say, it causes data leakage, and thus generally overly-optimistic estimates of performance. See e.g. SMOTE data balance - before or during Cross-Validation

But here that's not what you've done, and the test set score here should still be an unbiased estimate of future performance of the model. In that sense, based just on what you've done, this is fine.

Applying SMOTE across the training set before using cross-validation for model selection has leaked information to the test folds, but at the end of the day the model that got selected is performing well on the untouched separate test set. Every model that got compared had the same data leakage, so probably the effect was to select a model that performs well in general but also got the best improvements from the data leakage. So it could well be that there's a better model with less improvement from the leakage, but you should've seen that when you retried the experiment with the more-proper approach.

All that said, I suspect the improved performance is just due to noise. If you repeated the experiment with different splits, random states, etc., I expect you'll see the net improvement to be near zero (and maybe negative). (If you have some time to do that sort of experiment, I would love to see it as an answer here.)

Lucas Morin · Answer 2 · 2024-02-09T07:26:18.613

9

First of all, you need to know that the general consensus is turning against SMOTE. It doesn't really solve the problem it is supposed to solve, worse, the actual problem doesn't really exists.

Regarding a general answer: as long the solution is not leaky on the test, you are relatively good to try whatever you want. However you risk having a discrepency betwee, validation and test. What is the point of validation if you don't have consistency with test?

Regarding leakage: SMOTE creates points from the others. For exemple If you have point A and B you can create point (A+B)/2. Then if you split the points you might end up with leakage.

In the first case, using two folds you test the model on A when training on B and test on B when training on A.

In the second case you'll have leaky split typically you'll have following cases:

A in a fold, B and (A+B)/2 in the other
B in a fold, A and (A+B)/2 in the other
(A+B)/2 in a fold, A and B in the other In all these sub-cases, learning on a fold and testing on the other is simpler than the base case.

edited Feb 09 '24 at 07:26

answered Feb 08 '24 at 14:47

Lucas Morin

1,575

So it's the same as using smote on the whola data set instead of only on the train data? – maglorismyspiritanimal Feb 08 '24 at 15:20
I am not sure to understand the question, what do you mean by doing smote on the training set before cv ? – Lucas Morin Feb 08 '24 at 15:50
I mean that I split my data into my train and my test sets first. Then I used smote to oversample the minority class in the training set, then I fit LASSO with LOOCV on the train data, and then I used predict() on the test data. I assessed the model by looking at the confusion matrix of the test data. – maglorismyspiritanimal Feb 08 '24 at 16:04
2

do you have a discrepency between the in sample loocv and out of sample performance? – Lucas Morin Feb 08 '24 at 16:14
I have added some comments on oversampling in general and the point of validation. – Lucas Morin Feb 09 '24 at 07:26

Dikran Marsupial · Answer 3 · 2024-03-07T17:03:13.450

5

Yes, it is bad, and it isn't just leakage (+1 to the previous answers). This is because while the synthetic data generated by SMOTE is asymptotically from the same distribution as the data, this is not true for small samples, and imbalanced learning problems almost always have a small sample of the minority class. It is easy to see that this is not the case because the synthetic examples lie within the convex hull of the original data. For a small sample, the convex hull will systematically underestimate the true spread of the distribution. If you gather more real data, some of it will lie outside the convex hull of the original sample. Thus the estimate you get from "test data" containing synthetic data will be biased because the data distribution is not quite the same as that from the true data generating process. Ironically this means that SMOTE itself introduces a bias against the minority class (relative to the theoretical cost-sensitive learning effect implied by the balancing).

"but I just don't have enough samples on the minority class when split into train/test."

that means you don't have enough data for a cross-validation estimate to be stable (low variance) anyway. SMOTE creates more data, it doesn't create more information. SMOTE synthetic data should not be in the test set or in used for evaluation in cross-validation. There really is no benefit to doing so, and lots of potential pitfalls.

BTW, for very small datasets, I tend to use bagging and the out-of-bag performance estimator, rather than cross-validation. You can run it for many iterations to reduce the variance of the estimate and you get and ensemble classifier as a by-product that is likely to work somewhat better,

edited Mar 07 '24 at 17:03

answered Mar 07 '24 at 16:22

Dikran Marsupial

54,432
9
139
204

1

This is a very interesting point about SMOTE generally, but doesn't it suggest that the practice in this question is actually better? In SMOTE-per-fold, you'll have even smaller minority data and so less well-defined spread, so SMOTE will be even worse at estimating the true minority spread than SMOTE-before split. I think in that case there's an argument that selecting hyperparameters on the better-estimated minority spread is better? (Testing on an untouched test set being the final step still.) – Ben Reiniger Mar 11 '24 at 14:26
@BenReiniger for cross-validation you are basically estimating the performance of a method for generating a model, rather than the model itself, so I would always repeat all parts of the model fitting process independently in each fold (including SMOTE). We want to minimise bias and variance in the cross-validation estimate,but I don't think that SMOTE-first actually achieves that - it just hides the bias/variance issues introduced by SMOTING. – Dikran Marsupial Mar 11 '24 at 16:26
How to do model selection in small sample conditions (measuring the size of the data by the size of the minority class), I don't think SMOTE is the answer, I think regularising the cross-validation error is likely to be a better approach. I've tried this with kernel machines (but not in highly imbalanced cases) – Dikran Marsupial Mar 11 '24 at 16:27
The more I think about SMOTE, the more I wonder why people don't just use a Parzen density estimator and sample from that (perhaps using LOOCV for tuning the kernel parameters). – Dikran Marsupial Mar 11 '24 at 16:29
... of course interpolating between known examples is always reliable https://imgs.xkcd.com/comics/sphere_tastiness.png ;o) – Dikran Marsupial Mar 11 '24 at 17:36

score 4 · Answer 4 · answered Mar 07 '24 at 18:47

I understand that doing this leads to data leakage, but if I get better performance on the test set does it really matter?

YES, it matters! The problem with data leakage is that you can no longer trust test set performance. If your test set performance may have "improved" because of the data leakage, then the test set was not actually measuring improvement and should not be interpreted as evidence that one model is better than another.

If I ask my friend to time me while I sprint 100m, and it takes me 100 seconds, but I ask him to write down 10 seconds instead -- that doesn't mean I'm a faster runner, it means my measurements are faulty.

As other answers point out, it sounds like you're doing SMOTE in a way that doesn't actually cause data leakage. If so, this is not a problem (although SMOTE may still be problematic for other reasons). But because of how you opened your question, I wanted to make sure it's clear: data leakage really does mean that you can't trust test "performance."

Is it really so bad to do SMOTE on the training set before crossvalidation?

4 Answers4