Oversample procedure for nested cross valdiation

Question

I'm currently conducting a nested cross validation on Random Forest to determine optimal hyper parameters and outer loop generalisation. My dataset has a class imbalance problem with only a few (8%) minority cases. This means my classifier tends to only pick features from the majority class. A solution to this would be to improve the class imbalance using ROSE or SMOTE.

My question is how do you ensure that you do not overfit during nested cross validation when performing an oversampling manoeuvre. I'm using the mlr package in R to carry out my nested cross valdiation and i improve class imbalance before i run the nested cross validation. Has anybody got any experience using mlr package and if so how did you handle class imbalance in a nested cross validation code in mlr?

   lrn.rf_original50 <- makeLearner("classif.randomForest", predict.type = "prob")
# grid search for hyperparemetrs using stratified 10 fold cross validation in inner loop
ps_originalrf50<-makeParamSet(makeIntegerParam("mtry", lower = 1, upper =20 ),
                          makeIntegerParam("ntree", lower=1, upper=2000))
ctrl_originalrf50 <- makeTuneControlGrid()
inner_originalrf50 <- makeResampleDesc("CV", iters = 10,stratify=TRUE)
wrapper_inner_originalrf50 <- makeTuneWrapper(lrn.rf_original50,       
resampling = inner_originalrf50,par.set=ps_originalrf50,control=ctrl_originalrf50, show.info = T, measures = list(auc))

# Outer loop model performance using stratfied 10 fold cross validation
outer_originalrf50 <- makeResampleDesc("CV", iters=10,stratify=TRUE)
original_modelrf50 <- resample(wrapper_inner_originalrf50, monthdata_taskrf0,
                           resampling = outer_originalrf50, extract = getTuneResult,
                           show.info = TRUE, measures = list(auc))

The ROSE oversample was conducted before running this code and is in the outer loop as monthdata_taskrf0. How would i change this to ensure the oversample is only in the inner loop.

ROSE- Random Over-sampling Examples, SMOTE - Synthetic Minority Over-Sampling Technique. — Martin, Mar 21 '18 at 13:50
I ask a related question here with a little simulation showing the intercept term has poor calibration just using stratified sampling. I think the most valid results are obtained when using fully independent sampling. If there are no obs with class E in the CV dataset, but your model predicts 8% belong to E, then that means your model is doing something wrong. — AdamO, Mar 21 '18 at 13:53
Hi adam i think you misunderstand my question. My model predicts the majority cases because learners tend to learn and generalise on these points as it gives high accuracy. My question was how to ensure the class imbalance correction is performed in the inner loop rather than the entire inner and outer loop in the mlr package in r stats — Martin, Mar 21 '18 at 14:32
Does this answer your question? Perform Rose Method, then Logistic Regression and do k -fold cross validation If not, what remains unanswered? — Dave, Dec 19 '22 at 04:50

neuroguy123 · Answer 1 · 2018-03-21T13:54:53.830

Like any transformation, the SMOTE process has to take place in the INNER folds. That is, you need to apply SMOTE in the pipeline you want to test with a hyperparameter grid. I'm not as familiar with R as I am with python pipelines, but essentially you would have the a scheme such as:

pipe = Impute -> SMOTE -> RandomForest

clf = GridSearchCV(pipe, grid, inner_cv)

nested_scores = cross_val_score(clf, outer_cv)

So you can see that you have a clean hold out in the outer cv, but also a hold out from SMOTE in each grid search. Your grid search will be overly optimistic, but it at least prevents your oversampled cases from contaminating your grid searched test folds. That's the real danger with oversampling. Your outer hold out sets will give you your realistic, although slightly pessimistic, scores.

In my experience, Random Forests are somewhat resilient to imbalance, but it would depend on the dataset I suppose. With an extreme imbalance like you have I can see the need for oversampling. Have you also tried a svm with a weighted class score? That eliminates the need for oversampling and I find that they do a comparable job to Random Forests and XGBoost.

Thanks for your answer neuroguy. Unfortunately it is not so easy to ensure that the SMOTE is applied only in the inner loop. The coding for mlr is a little tricky and it is not clear how to make this happen... — Martin, Mar 21 '18 at 14:05
Any time you feel that you have to do something about imbalance you are either envisioning the prediction problem incorrectly or using an inappropriate accuracy score. — Frank Harrell, Aug 11 '22 at 12:47

Oversample procedure for nested cross valdiation

1 Answers1