I'm currently conducting a nested cross validation on Random Forest to determine optimal hyper parameters and outer loop generalisation. My dataset has a class imbalance problem with only a few (8%) minority cases. This means my classifier tends to only pick features from the majority class. A solution to this would be to improve the class imbalance using ROSE or SMOTE.
My question is how do you ensure that you do not overfit during nested cross validation when performing an oversampling manoeuvre. I'm using the mlr package in R to carry out my nested cross valdiation and i improve class imbalance before i run the nested cross validation. Has anybody got any experience using mlr package and if so how did you handle class imbalance in a nested cross validation code in mlr?
lrn.rf_original50 <- makeLearner("classif.randomForest", predict.type = "prob")
# grid search for hyperparemetrs using stratified 10 fold cross validation in inner loop
ps_originalrf50<-makeParamSet(makeIntegerParam("mtry", lower = 1, upper =20 ),
makeIntegerParam("ntree", lower=1, upper=2000))
ctrl_originalrf50 <- makeTuneControlGrid()
inner_originalrf50 <- makeResampleDesc("CV", iters = 10,stratify=TRUE)
wrapper_inner_originalrf50 <- makeTuneWrapper(lrn.rf_original50,
resampling = inner_originalrf50,par.set=ps_originalrf50,control=ctrl_originalrf50, show.info = T, measures = list(auc))
# Outer loop model performance using stratfied 10 fold cross validation
outer_originalrf50 <- makeResampleDesc("CV", iters=10,stratify=TRUE)
original_modelrf50 <- resample(wrapper_inner_originalrf50, monthdata_taskrf0,
resampling = outer_originalrf50, extract = getTuneResult,
show.info = TRUE, measures = list(auc))
The ROSE oversample was conducted before running this code and is in the outer loop as monthdata_taskrf0. How would i change this to ensure the oversample is only in the inner loop.