Modern machine learning and the bias-variance trade-off

Question

I stumbled upon the following paper Reconciling modern machine learning practice and the bias-variance trade-off and do not completely understand how they justify the double descent risk curve (see below), desribed in their paper.

In the introduction they say:

By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that have smaller norm and are thus "simpler". Thus increasing function class capacity improves performance of classifiers.

From this I can understand why the test risk decreases as a function of the function class capacity.

What I don't understand then with this justification, however, is why the test risk increases up to the interpolation point and then decreases again. And why is it exactly at the interpolation point that the number of data points $n$ is equal to the function parameter $N$?

I would be happy if someone could help me out here.

This phenomenon is closely related to how deep learning models do not overfit the training set but achieve almost zero loss, i.e., interpolation counterintuitively against the well known statistical learning theory and of course classical bias-variance trade-off is not resolved. See recent exposition from Orial Vinyal's team Understanding Deep Learning (Still) Requires Rethinking Generalization. >From this I can understand why the test risk decreases as a function of th — patagonicus, Jun 23 '21 at 20:54

score 13 · Accepted Answer · answered Jun 23 '21 at 12:07

13

The main point about Belkin's Double Descent is that, at the interpolation threshold, i.e. the least model capacity where you fit training data exactly, the number of solutions is very constrained. The model has to "stretch" to reach the interpolation threshold with a limited capacity.

When you increase capacity further than that, the space of interpolating solutions opens-up, actually allowing optimization to reach lower-norm interpolating solutions. These tend to generalize better, and that's why you get the second descent on test data.

answered Jun 23 '21 at 12:07

Firebug

19,076
6
77
139

4

What is worth highlighting is that this is just a guessxplanation, a hypothesis that cannot really be proven or disproven. This is how the authors try to explain the phenomenon. – Tim Jun 23 '21 at 12:13
1

@Tim It's a formalized hypothesis. They gave multiple experiments corroborating the finding. – Firebug Jun 23 '21 at 12:16
@Firebug Thank you for answering. However, your answer raises some more questions. Do you mean by "number of solutions" predictors that can fit the training data exactly? If so, why is the number of solutions exactly at this point so much constrained? And why is it, that at this point the training data can be fitted exactly? Why is it, that beyond the interpolation threshold, lower-norm interpolating solutions can be found? – Gilfoyle Jun 23 '21 at 18:39
@Tim Do you can think of promising approaches to prove or disprove the hypothesis? Wouldn't it be possible to test the hypothesis on very large networks to support the hypothesis if true? – Gilfoyle Jun 23 '21 at 18:42
@Samuel this is what the paper you refer to & similar ones try to achieve. – Tim Jun 23 '21 at 18:56
@Samuel the "number of solutions" is the number of parameter sets, not predictors, that achieve a given loss. So the number of solutions at the interpolation threshold is the number of networks that interpolate the training data. This set is much smaller than the set after this threshold is surpassed. – Firebug Jun 24 '21 at 10:57
1

@Samuel "why is it, that at this point the training data can be fitted exactly?" because that's the definition of the interpolation threshold: the point with least capacity where training data can be interpolated. At higher capacity, network parameters have more leeway in how they achieve this interpolation, leading to lower norm solutions. – Firebug Jun 24 '21 at 10:58
@Firebug If we are in the interpolation threshold, where training loss is zero, how we move to the better solutions if we are already in a minimum? Assuming a neural net is trained with gradient descent, then if we are in a minimum, gradient will be zero and we won't get any update. So how we pass the interpolation threshold? – ado sar Jul 02 '23 at 23:45
@adosar weight decay – Firebug Jul 03 '23 at 07:11
@Firebug I am really confused since there are sources reporting that this phenomenon also happens without regularization. – ado sar Jul 03 '23 at 11:52
@adosar I looked your link and could not find a source, could you kindly point me one? Also, keep in mind double descent is not only about epochs, it also occurs with other forms of "model capacity", where weight decay wouldn't necessarily play a role – Firebug Jul 03 '23 at 14:07
@Firebug Check the first sentence of the link. – ado sar Jul 03 '23 at 14:11
@adosar I unfortunately wasted my time reading through one of the papers already, and it didn't pertain to weight decay at all. It would be nicer if you could point out which one specifically shows double descent in maximum epochs without weight decay. – Firebug Jul 03 '23 at 14:13
@adosar I think the confusion is about the x axis there? Commonly we see plots with epochs plotted against training loss. In this case the x axis is not epochs, it's the number of parameters in the model and each model has been trained until it stops improving. In order to reach interpolation, you need to have enough parameters in the model to do so. Models with fewer parameters cannot reach 0 training loss no matter how long they are trained. – ttbek Jul 26 '23 at 08:27
@ttbek Thanks for pointing out! So in the overparameterized regime there are better (smoother, generalizing better) candidates for approximating the true underlying function and as such we can get a better model? At least this is how I get it. – ado sar Jul 26 '23 at 11:29
1

@adosar That's roughly how I think about it. If we think about (just for analogies sake) interpolating polynomials, they tend to have bad edge effects in between points when having exactly the number of parameters to interpolate a given set of points. That sort of behavior is mitigated by Chebyshev polynomials, but not totally eliminated. On the other hand, spline interpolation (polyharmonic splines) doesn't tend to have these issues. – ttbek Jul 26 '23 at 12:07

Modern machine learning and the bias-variance trade-off

1 Answers1

Linked