4

I am running a random forest for classification of a data with three classes in R, and each class has around 20 samples. I am partitioning data into train and test in 80:20 ratio using caret package. As the sample size is low, I used a for loop 1 to 100 to build 100 models to see how accuracy changes with each partitioning. I obtained accuracy of 1 for all 100 models, and all test data can be classified 100% in two steps. Only change is in top n features based on gini index. I was using all top n features of 100 models, and using median of scores for the common features across them. I wanted to ask your suggestions as accuracy of 1 is indicator of over-fitting.

Updating to add an example script to illustrate RF model I built;

for (i in 1:100) {
  # partition
  set.seed(i)
  df.training.samples <- df$type %>%
    createDataPartition(p = 0.8, list = FALSE)
  df.train.data  <- data.frame(df[df.training.samples, ], check.names = T)
  df.test.data <- data.frame(df[-df.training.samples, ], check.names = T)
  # build model
  set.seed(i)
  df.tree <- randomForest(type ~ .,
                          data = df.train.data,
                          importance = TRUE,
                          ntree = 1000,
                          maxnodes = 3)
  # add model to list
  model.list[[i]] <- df.tree
  # performance on test data
  df.actual <- df.test.data$type
  df.predicted <- predict(df.tree,
                          df.test.data)

acc.list[i] <- calc_acc(actual = df.actual, predicted = df.predicted) } ```

1 Answers1

0

A perfect score is all but screaming at you that overfitting has occurred. In order to check for overfitting, it is common to test our models on data that we’re not used for model training, which you have done and still achieved that perfect score. Because your performance is too strong to seem reasonable, my suspicion is that you have leaked data from your training data to the test data, allowing the model to cheat and see the test data before it is supposed to. (This could be considered analogous to a student stewing the exam to look at the questions and then scoring high on the test. The high score might sound like mastery of the material, but it is from cheating, not from knowing what he is doing.)

Dave
  • 62,186
  • Note that accuracy is a surprisingly problematic measure of model performance (1) (2). However, I have a suspicion that you have a data leak and can deal with the performance metric later. – Dave May 29 '23 at 17:19
  • Dear @Dave, thank you for your response. I updated my question by adding the example code that I used for my dataset if you think any of the steps I used might contribute data leak. I will be checking other documentations to find any possible source. – eraysahin May 29 '23 at 20:32