2

My RandomForest model trains to 100% accuracy on the training set (70,000 rows) and 99.9% accuracy on the test set (30,000 rows).

Can a model still be "overfit" if it is hitting 99.9% on the hidden test set on Kaggle (ie, 30,000 rows withheld on Kaggle by instructor)?

Running predict(randomForest.fit, data=test, type='response') gives all probabilities that are very close to 1 or 0, like 0.99 or 0.1.

Plotting this as an ROC curve seems meaningless - it is a straight vertical line to the upper-left (perfect classifier) point, then straight to the right:

  .______
  |
  |
  |

Thanks in advance.

puf
  • 21
  • How's your precision-recall curve? – jsaporta Nov 27 '20 at 02:13
  • A decent hack is to back off on the parameter size, to reduce the max depth to require more samples per leaf, Until you see an error curve that you can operate against. Then you can set thresholds and parameters, and then undo those hobbles. – EngrStudent Nov 27 '20 at 03:09
  • 1
    What's your definition of overfitting? Does this observation satisfy that definition? Why or why not? – Sycorax Nov 27 '20 at 04:40
  • @Jason Well, using the predict(type="response",..) output (YES/NO), the confusion matrix shows a single misclassified point out of 70,000, for a random-forest of depth 5. Even if I set predict(type="prob",..), the probabilities are so skewed towards 0 or 1 that they will all generate the same confusion matrix (ie, the one with a single misclassified point). – puf Nov 27 '20 at 07:54
  • @EngrStudent Yes, that is a good idea; without other options, it may be worth a try. Thanks! – puf Nov 27 '20 at 07:57
  • @Sycorax not sure-- i'm still just learning. I only have a single misclassified point out of 70,000 for a random forest of depth 5. Does that seem strangely accurate or not? I don't know. It seems overfit by intuition. – puf Nov 27 '20 at 07:58
  • 1
    If something seems too good to be true it's worth triple checking your code for bugs and data for leaks. – Tim May 30 '23 at 07:38

2 Answers2

1

Can a model still be "overfit" if it is hitting 99.9% on the hidden test set on Kaggle (ie, 30,000 rows withheld on Kaggle by instructor)?

Consider a situation where you have a strong class imbalance where $99.95\%$ of the categories belong to one class. In such a situation, your accuracy, after going through all kinds of trouble to learn and implement fancy machine learning methods, is worse than some jerk would get by predicting the majority category every time. In this case, your model performance turns out to be quite poor, despite what appears to be a sky-high accuracy score, so pointing out, "Look at how good my holdout performance is! No overfitting here!" does not work.

It might be that you could achieve $99.97\%$ training accuracy and $99.96\%$ holdout accuracy with a simpler model, which would indicate overfitting according to a pretty standard definition where in-sample performance is improved at the expense of out-of-sample performance.

Despite the flaws of accuracy as a performance metric, I agree that $99.9\%$ on holdout data at least sounds impressive (though, depending on the prevalence, it might be poor), and if you have reason to believe that $99.9\%$ accuracy really is good performance for this task, you might not care that you could achieve $99.92\%$ with a simpler model that has a worse in-sample score, despite the overfitting present in your model. (Whether or not you should be interested in the accuracy score is a separate matter, one addressed in the above link.)

Dave
  • 62,186
  • 1
    You notice the flaws in accuracy and then seem to accept 99.9% as an “impressive” result. It isn't unless you know the base rate. – Tim May 30 '23 at 07:36
1

adding to dave answer. You should also be careful if your train and test data were "seperate" with no look ahead and leakage.