Why does the best model in the training set have the worst test result?

Question

I have trained eight models using 10-fold cross-validation, and evaluated the models by using resampling technique as described here. The result shows that SVM with sigmoid kernel (SVM-s) and random forest (RF) are the best models, while penalised regressions had the lowest performance:

However, when I predicted the target variable by using the test set, the ROC curves showed very different ranking of model performance. The SVM model turned out to have the lowest AUC, and the penalised regressions had outperformed naive Bayes, neural network and CART.

AUC: Lasso, RF, Elastic net, Ridge: 0.75; Naive Bayes, neural network: 0.70; Classification tree, SVM: 0.65.

Are there any good explanations for almost opposite model performance?

score 1 · Accepted Answer · answered Jul 30 '21 at 15:19

Yes, this is a classic example of "overfitting." At a high level, this means that your model is "learning" the training set so well that it's "memorized" features that don't work as well in the testing set. In fact, powerful classifiers like deep neural networks will often get 100% accuracy on the training set, and nowhere near that figure on the validation set.

Ideally, when selecting a model, we want our model to 1) learn during training and 2) generalize to unseen data (validation set). We'd like to see a model that has good training performance, and a small gap between train and validation performance. For your metric (AUC-ROC), I'd calculate the area between curves as discussed here to quantify how much your models are overfitting so you can make a better decision.

I've left some intuitive (hopefully) explanations for this effect below as well for your reference.

Some Intuition

Intuition with model accuracy. I'm going to talk about this in the context of accuracy. I think the below image from Wikipedia demonstrates the general idea quite well: suppose you're doing a binary classification task (separate blue and red data points). We see the training set here. The green classifier has 100% accuracy: it draws a (very twisty) boundary between the two classes.

Green classifier overfits. However, we might intuit instead that the separation between red and blue data points is better explained by the black boundary. Based on this intuition, it seems like the green classifier "overcomplicated" the problem: there's going to be some regions, which may appear in our test set, where the simpler explanation (black boundary) works better! The green classifier is very sensitive to small changes in the data.

Bias-Variance Tradeoff

That "almost-opposite" effect that you notice here is the bias-variance tradeoff, which can be derived by decomposing the mean squared error of a model. The bias is the error resulting from inherent limitations/assumptions in the model; variance is the sensitivity of the model to small data changes.

Intuition: why is there a tradeoff? A "simple" model may not be able to capture the underlying differences between data classes well (high bias), but isn't as sensitive to small changes in the data (low variance), while a complex model might be able to separate data classes well (low bias) but runs the risk of memorizing noise in training data, which is quite brittle (high variance).

Some of the models you've posted are overfitting, demonstrating low bias and high variance. Regularized regression methods like ridge or LASSO are specifically designed to combat overfitting, so I'm not too surprised by your results.

Thank you very much for the reply, amazing answer and very helpful indeed. :) — Nina, Jul 30 '21 at 20:11
Probably a quite newbie question: what should one do with overfitting that happened to certain models (e.g. the SVM sigmoid)? A quick search online suggests (1) increase sample size for the test set, (2) k-fold cross validation or (3) resampling. I've tried all three approaches but they did not seem to solve the problem. Any suggestions? :) — Nina, Jul 30 '21 at 20:28
No, these are fine questions. Think of it this way: overfitting is a thing that can happen given some model and some training set. So doing stuff to the test set is a little orthogonal: 1) and 2) will only help you catch overfitting more reliably. To combat it, you need to either 1) do something to your model or 2) do something to the training set. — chang_trenton, Jul 30 '21 at 20:36
I don't quite understand what 3) is -- "resampling" could mean a lot of things, so if you have a link to the specific method (or even the code) I might be able to help more. — chang_trenton, Jul 30 '21 at 20:36
But back to changing the model: as a heuristic, you can think of more "complicated" models as ones that have more parameters (this doesn't hold true always!) -- for example, I'd expect neural networks to overfit pretty badly here. Alternately, "simpler" models are ones that make some assumptions about your data or constrain the model somehow -- i.e. small parameter norm (as in ridge) or sparsity (as in LASSO). For random forest, a simpler model might be shallower, or has less trees. SVMs I haven't played with much recently and I'd have to review the math. — chang_trenton, Jul 30 '21 at 20:39
And lastly, for changing the training set -- you can make it larger, either by manually collecting more data, or, if viable, generating synthetic data using SMOTE, data augmentation, or related techniques (depending on your task). — chang_trenton, Jul 30 '21 at 20:39
At a glance, this answer seems to discuss SVMs and overfitting at a much deeper level than I can. — chang_trenton, Jul 30 '21 at 20:41
Wow, thank you so much for the great answers! And the link :). Yes I have oversampled the training set with SMOTE, and resampling is basically the same as k-fold cv (should have been only two points there). It's also a good idea to test simpler models! I'll try it. Seems like a paradox here: we tune the hyperparameters to improve the models' accuracy and to get better better fit; on the other hand it leads to overfitting... — Nina, Jul 30 '21 at 21:04

Why does the best model in the training set have the worst test result?

1 Answers1

Some Intuition

Bias-Variance Tradeoff