5

I'm using Random Forest in the CARET package to tag a binary outcome with 1/10 ratio, thus I need to balance the dataset.

I know two ways:

  1. Use SMOTE as a stand-alone function and then pass it to the training.

  2. Use sampling='smote' inside CARET's training

As far as I understand, the first approach should be better, for it uses the whole data set to synthesize new samples (I know it uses only the 5 nearest neighbors by default, but still have more data points to choose from) while the second method only uses the data points available in each partition of the CV.

However, are there any benefits in balancing inside the CV?

Riddle-Master
  • 443
  • 3
  • 16

2 Answers2

9

The second method should be preferred for exactly the reason that you gave to justify the first. The first method uses the whole data set to synthesize new samples. Cross validation is excluding points from training to give an accurate assessment of the error rate on new data. If you use SMOTE first, information from the excluded points will be leaked into the training data and will taint the XV testing.

G5W
  • 2,620
2

Method 1 should not be used as it leaks information from the test partition into the training set in each fold of the cross-validation. This is because a synthetic example may lie between a real training pattern and a real test pattern or between two real test patterns. Consider a synthetic example generated by random chance very close to the real test pattern ending up in the training set.

The way to look at it is that cross-validation is a method of evaluating the performance of a procedure for fitting a model, rather than of the model itself. So the whole procedure must be implemented independently, in full, in each fold of the cross-validation. So if SMOTE is part of the model-fitting procedure, it needs to be done separately in each fold.

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204