Recently I am working on a project and I found my cross-validation error rate very low but the testing set error rate is high, which might indicate the overfiting of my model. But why my cross-validation does not overfit while test set overfits??
More specifically, I have about 2 million data with 100 variables (n>>p). I randomly 8/2 split the dataset into train set and test set. Then I fit a model (i.e. XGboost) using a 5-fold cross validation on the train set, and the estimated error rate is pretty low. Then, I used the same parameter setting and use the entire train set to fit the model. Surprisingly when I use the test set to evaluation the performance of the model, the error rate is significantly higher than the CV error rate. WHY?
+++ 1. Edit about the error rate +++
The error rate is actually multinomial logloss. I achieved a CV error rate of 1.320044 (+/- 0.002126) and a testing error rate of 1.437881! They might seem close by staring at these two numbers, but actually, they are not. I don't know how to justify this but I am sure that they are different within the scale of performance of this project, which is from ~1.55 to ~1.30.
The way of 5-fold cross validation is like following,
- divide the train set into 5 sets.
- iteratively fit a model on 4 sets and test the performance on the rest set.
- average the performance of all the five iterations.
I mean, if my parameter settings make model overfit, then I should see it at this cross-validation procedure, right? But I don't see it until I use the test set. Under what circumstances on the earth this could happen?
Thanks!
++++++++++++++ 2. Added ++++++++++++++
The only reason I could think of that why CV error rate diffs from test set error rate is
Cross-Validation will not perform well to outside data if the data you do have is not representative of the data you'll be trying to predict! -- here
But I randomly 8/2 split the 2-million-sample data set and I believe that the train set and test set should have from a same distribution of variables.
++++ 3. Edit about data leakage ++++
From the comments, @Karolis Koncevičius and @darXider raised an interesting guess, data leakage. I think this might be the devil here. I wonder what is data leakage? And how to avoid data leakage? And how to detect data leakage? I'll do more research about it.
THANKS!
And as a first guess - you can try checking error rates in each fold of cross validation separately. Maybe the procedure has variable accuracy and your final classifier falls on the bad end of it. But with 2 million samples that should hardly be the case...
– Karolis Koncevičius Mar 01 '17 at 21:38Pipeline()ing andFeatureUnioning (if you are using Python, that is). – darXider Mar 01 '17 at 22:14