How to know if GBM or XGBOOST are overfitting?

Question

Unlike in random forest, I can check the oob_score_ to verify how much my model is overfitting. How can I verify this for boosted trees algorithms?

Currently, Using GBM I'm getting a accuracy of 95% on training set and 87% on test set. These numbers are after tuning on my validation set. How do I judge if this model is good enough or not?

Also, I ran 10 K-fold cv and my variance is around 1% for test set.

UPDATE: I'm using SMOTE to balance my classes

"How do I judge if this model is good enough or not?" Good enough for what? What's the error tolerance on your problem? What are the costs for each kind of error? — Sycorax, May 01 '18 at 18:39
Ideally, the difference between the training score and the test score should be as close as possible. Sometimes, this is not feasible though.
Another note. You have to measure your test set accuracy on real data (i.e. without SMOTEing them). — Stergios, May 02 '18 at 08:02

score 1 · Answer 1 · answered May 01 '18 at 18:26

1

Without knowing the exact details of your cross-validation and assuming you have a true 'test' set that was never used during any step of the training I would say your model is about as good as it can get. A 95% accuracy is quite good for an internal test error and 87% on the external test dataset isn't completely off base. There may be some overfitting, but as I said, it is difficult to say for certain. Keep in mind that your 'final' test accuracy is almost always going to be lower than your training error (unless you have underfit).

answered May 01 '18 at 18:26

cdeterman

5,101

Thanks, but what kind of details are you looking for cross-validation. I ran the cross_val_score on my validation test set, the model has only seen the training set and not test/validation. Also, I'm using SMOTE to balanced my classes, not sure if that has anything to do here – Jaskaran Singh Puri May 02 '18 at 04:57
1

@JaskaranSinghPuri is your test/validation data from the SMOTE balanced data as noted by Stergios? That final test needs to be on the 'real' data that is imbalanced. – cdeterman May 02 '18 at 13:27
When I wrote this question I was first SMOTEing and then splitting. After fixing that, I've seen a big drop in my accuracy now, now my SMOTE is only on training data, test is original! – Jaskaran Singh Puri May 03 '18 at 06:27

How to know if GBM or XGBOOST are overfitting?

1 Answers1