0

I am running an ensemble random forest model (a newer method published in 2020). The model works by using a double bootstrapping step to balance imbalanced training data. Then you grow multiple forests and take means of performance metrics across forests (i.e. the ensemble). This method is supposed to take the place of cross-validation across folds as well. The current split ratio for training:test data is 90:10 (also tried 80:20). I have tuned the hyperparameters for the number of trees (ntree=1000) and the number of variables to randomly sample at each node split (mtry=3). The model runs smoothly but continues to overfit. I know this because I compared AUC, RMSE, and accuracy between the ensemble means and the test set. My ensemble continually performs significantly better (ex. ensemble AUC= 0.99, and test AUC=0.72). I've plotted learning curves but there is little change regardless of how high or low I change ntree or mtry. My data has 10 predictors (2 are factors the rest are continuous) and a severely imbalanced binary response, though the response should be balanced by the second bootstraping step. I have removed or converted any outliers and tried subseting the variables in different combinations however no substantial change in the fit gap of the ensemble vs test performance. Two pairs of my predictors have weak correlations, but even if I remove these variables there is no substantial change. Currently, there does not appear to be a direct way of changing the tree depth for this package, which is how I would have tried to deal with overfitting if running a basic random forest model. Any suggestions on how to deal with overfitting would be appreciated.

Here is the github for the r package: https://github.com/zsiders/EnsembleRandomForests?tab=readme-ov-file

Here is example code from the package's author(s): https://zsiders.github.io/EnsembleRandomForests/

Below is my code for retrieving the threshold-dependent performance, which has not been altered from the package's example code:

# Same procedure as above but on each test set
test.perf <- sapply(ens_rf_ex$roc_test, 
                    function(x){id <- which.max((x$sens@y.values[[1]] + 
                                                   x$spec@y.values[[1]])-1);  
                    sapply(x, function(y) ifelse(class(y)[1]=='performance',
                                                 y@y.values[[1]][id],
                                                 y))
                    })
# Average test performance
test.perf.df <- data.frame(Metric = round(rowMeans(test.perf),3),
                      Name = c('True Positive Rate',
                               'True Negative Rate',
                               'Area Under the Curve',
                               'Correlation Coefficient',
                               'Accuracy',
                               'Error',
                               'False Positive Rate',
                               'Sensitivity',
                               'Specificity',
                               'True Skill Statistic',
                               'Root Mean Squared Error'),
                      lower.bnd = c(0,0,0,-1,0,0,0,0,0,0,0),
                      upper.bnd = c(1,1,1,1,1,1,1,1,1,1,1),
                      perfect = c(1,1,1,1,1,0,0,1,1,1,0))
knitr::kable(test.perf.df, caption = "Mean Test Performance")

0 Answers0