3

I hear countless times at my job that the gap between train and test implies that there's overfitting. Some people go as far as saying that the goal of model selection is to reduce the gap between train and test. The same people also say that a model with a lower train and test gap but worse test performance is a model that generalizes better.

To me, this seems like a profound misunderstanding about what overfitting and generalization means, and also the bias-variance trade-off.

I took this graph from Elements of Statistical Learning. We see that in this image as we vary complexity, the best fit model is one where the gap between train and test is relatively high, at least with respect to all the models to the left. If we were to do what my coworkers suggest, which is to decrease the gap between train and test, we'd almost always be selecting significantly underfit models.

Am I crazy? Are my coworkers right? Please, I need resolution.

ESL

Nick Corona
  • 127
  • 6
  • See also https://datascience.stackexchange.com/q/66350/55122 – Ben Reiniger Mar 12 '21 at 01:37
  • Seems there's a lot of disagreement on that thread. I would choose the first model provided you know that the validation set is large enough to make the 1% difference statistically significant. I also wouldn't define the first model as overfit because to me, and also in academia, overfit means adding complexity for no reason. That is, generalization doesn't improve, which clearly in the case of model A, it improved, even if it took a lot more complexity to get there. – Nick Corona Mar 12 '21 at 01:43

1 Answers1

3

I hear countless times at my job that the gap between train and test implies that there's overfitting.

This is true, but that does not mean the model is a bad one.

Some people go as far as saying that the goal of model selection is to reduce the gap between train and test. The same people also say that a model with a lower train and test gap but worse test performance is a model that generalizes better.

I don't understand the rationale here. Why would I use a model which has a larger generalization error than an alternative model? The advice not to over fit is good, but that does not mean that all over fitting is bad. I mean, the idea that there is some distinct line in the sand beyond which we can say we have overfit is dubious. There are degrees of overfitting, and you have to determine if the amount you have is acceptable.

Even very simple models overfit. Here is an example of that happening

library(rms)

x = rnorm(100) y = 2*x + 1 + rnorm(100, 0, 3) model = ols(y~x, x=T, y=T) validate(model)

      index.orig training    test optimism index.corrected  n

R-square 0.2493 0.2640 0.2365 0.0274 0.2219 40 MSE 10.4684 10.0754 10.6462 -0.5707 11.0391 40 g 2.1228 2.1498 2.1228 0.0270 2.0958 40 Intercept 0.0000 0.0000 0.0238 -0.0238 0.0238 40 Slope 1.0000 1.0000 0.9942 0.0058 0.9942 40

This is possibly the best scenario you can find yourself in. I've got the likelihood and the functional relationship correct, and what do we see? The optimism in all metrics is non-zero. Has the model overfit? Yes, even though it is the "right" model. Does this mean this model is bad? It depends, overfitting is a spectrum and not a dichotomy.

Anyway, let me offer a definitive answer. All models overfit to a degree. That I get a different model when I train on a different dataset means that any given model will have some performance degradation when used on new data. The size of that degradation is a function of sample size and model complexity, but the model can have performance degradation and still be a good model (cue the Box quote). It is the job of the analyst to estimate the degradation and decide if the model is still sufficiently good at its job in order to be used. If your coworkers are purposfully choosing models simply because they have 0 degradation between train and test, even when models with superior generalization error are available, I would probably stop listening to what they have to say quite frankly.

  • Tell me if you agree with what I'm about to say. I think there's a distinction between overfitting and memorization. I think the two concepts are slightly different. Memorization is what your model is doing to the training data. Some of what it "learns" is correct and some of it is noise. Overfitting is when you've gone too far with memorizing the training data and now generalization worsens. Put another way: it's when you add complexity to a model and it either worsens its generalization or keeps it the same. – Nick Corona Mar 12 '21 at 00:50
  • I couldn't fit it in the first comment, but thank you for your detailed answer. It does add clarity. – Nick Corona Mar 12 '21 at 00:56
  • 1
    @NickCorona I'm not really interested in exact definitions of overfitting, the important point is that a model which is overfit will fail to generalize. "Fail to generalize" admits degrees of severity, and my point is that every model will fit its training data too well by design. That doesn't make it a bad model. – Demetri Pananos Mar 12 '21 at 00:56
  • I understand. It's just that I'm wondering if an imprecise use of language is why people might fall into the trap of confusing the gap between train and test with generalization itself. I'm not too interested, either, but I've become interested in precisely defining it because I'm trying to persuade other people at work to improve their work quality. – Nick Corona Mar 12 '21 at 01:00
  • @NickCorona There is a tradeoff I like to talk about called The Precision-Usefulness trade off. It states that exactly precise statements (like statistical ones) are useless because they are far removed from what we want to use them for. We can make those statements more useful by making them less precise. I leave it to you to find the optimal balance of precision and usefulness for you and your coworkers ;) – Demetri Pananos Mar 12 '21 at 01:02
  • Fair point, but in this case, I think being more precise is useful here. I also accept overfitting is a spectrum. But, I would classify overfit models as ones that are to the right of the graph I linked above, of which there are many. If you're straddling the line of the local minimum, then I would just say there's uncertainty as to whether you're overfitting or not. The value of seeing it this way is that you don't use a negative connotative word to describe a model that generalizes well. Or in your case, a model that is on the mark. – Nick Corona Mar 12 '21 at 01:57