6

Setup

Let's assume we have data $(x_{i},y_{i})_{i=1}^N$ and we assume they have a true functional relation $y_{i}=g(x_{i})$. I.e. we choose an arbitrarly complex function (like $g(x)=\sin(x^{2}\cdot \ln(\tanh(x)))+x e^x+\dots$)

Now I want to fit a feed-forward-NN $f_{\theta}$ with witdh $W$ and depth $D$ to the data to approximate $g$ in the best possible way, i.e. we minimize $$l(x,y)=\frac{1}{N}\sum_{i=1}^N(f_{\theta}(x_{i})-g(x_{i}))^2$$ with gradient descent.

What I don't mean

First I want to clarify what I don't mean. For noisy data I understand that more data is better basically due to the law of large numbers. I also know that in statistical learning there are bound on the empirical risk minimization. I.e. if the data is i.i.d. there is an upper bound to the empirical risk 1: $$L_{p}(f_{\theta}) \leq \hat{L}_{p}(f_{\theta})+2 \mathcal{R}_{m}(\mathcal{F})+\mathcal{O}\left(\sqrt{\frac{\ln (1 / \delta)}{m}}\right)$$

Question

However, now we have data without uncertainity. Let's assume we only train on data up to $1 \ll n \ll N$. Is there a general theorem showing that training on $(x_{i},y_{i})_{i=1}^n$ will lead to a worse fit, i.e. we will not be able to find the same ideal parameters as with $(x_{i},y_{i})_{i=1}^N$ or is just the convergence rate slower?

My intuition

My intuition is that, given $(x_{i},y_{i})_{i=1}^n$ and $(x_{i},y_{i})_{i=1}^N$ have the same "information" (I know this is a fuzzy term), using $(x_{i},y_{i})_{i=1}^N$ should not better the fit. It should even make generalisation worse since we overfit.

NicAG
  • 171
  • 2
    I think you should define precisely what you mean that a sample of size $n \ll N$ has the "same information". It doesn't jive with any statistical definition of information that I know of. – AdamO Dec 15 '22 at 06:22
  • @AdamO One way to go about it would be to say, that doing a taylor expansion with both datasets, the relative error of these taylor expansions is smaller than some constant $\delta$ – NicAG Dec 15 '22 at 10:47
  • This will depend on what model exactly you are fitting. If data is precise, you need as many observations as are required to determinate the model (in many cases that's the number of degrees of freedom of the model). As long as you don't have enough data, having more will be better, but once you have enough, there's no improvement possible anymore. – Christian Hennig Dec 15 '22 at 13:14
  • 2
    Since you assume no noise, this is basically a question about interpolation. Many interesting things are known regarding the interpolation properties of neural networks, see, e.g., here, here, and here. (I wasn't able to quickly find a precise answer to the question, otherwise, I would have posted it.) I could also imagine that some general results from interpolation theory would apply here as well? – Eike P. Mar 13 '23 at 23:06
  • 2
    Theorems 3 and 4 in this paper might answer your question? (I only skimmed the paper.) – Eike P. Mar 13 '23 at 23:16

3 Answers3

6

You are right, it is not only about the size of the dataset. As two other answers pointed out, having more data (vs very little) is desired, as even in a noiseless scenario it may help you to get a more precise estimate. On another hand, as you argue, it is also about the quality of the data.

There's a great lecture by Xiao-Li Meng titled "Statistical paradises and paradoxes in Big Data", it was recorded and is available on YouTube. He argues that it is not only about quantity, but also quality. Having a lot of very poor quality or biased data is not necessarily great. As he shows, how much you gain from the quantity raises very slowly. That's one of the reasons why we need all those poor-quality, non-randomly sampled datasets, scrapped from the internet to be so big. On another hand, as he argues, having a large dataset can lead to overtly optimistic error estimates, which could give us a false sense of the estimates being precise. So in fact, there may be scenarios where "too much" data is not great.

So it is not (only) about quantity, but also if the data is relevant, not biased, how noisy it is, if it was randomly sampled, etc, i.e. about the quality of the data. If you have good quality data, more is better, though at some point you start hitting diminishing returns, where getting more data is simply not worth it.

Tim
  • 138,066
5

Intuitively, having more data will tell the neural network where to turn, by how much, and in what direction (up/down, left/right, combinations, extensions in high-dimension spaces, etc).

Imagine your true function to be a parabola. However, you only have two data points. You have no way to capture the curvature. You cannot figure out if the parabola opens up or down. You cannot figure out how wide the parabola is. When you add a third point, you can start to figure out some of this. However, that assumes you know the shape to be a parabola. If you do not know the function, how can you distinguish that from something like an absolute value function that uses straight lines?

By having more data, you provide more opportunities to penalize the network for turning incorrectly, even if the fit on fewer points is perfect.

This sounds like resolution, and I suspect that there is a way to tie this notion to the Nyquist rate and Nyquist–Shannon sampling theorem in signal processing (or some generalization to higher-dimension spaces) to bring full mathematical rigor.

Dave
  • 62,186
  • The Nyquist rate and the sampling theorem possibly don't apply here because that function $g(x)$ has a probably a Fourrier transform without a finite bound. – Sextus Empiricus Mar 14 '23 at 07:12
  • One other crucial difference to the usual sampling theory setting is also that here, sampling will not be uniform. – Eike P. Mar 14 '23 at 12:27
5

My intuition is that, given $(x_{i},y_{i})_{i=1}^n$ and $(x_{i},y_{i})_{i=1}^N$ have the same "information" (I know this is a fuzzy term), using $(x_{i},y_{i})_{i=1}^N$ should not better the fit. It should even make generalisation worse since we overfit.

At first I was skeptical, thinking 'how can you overfit if there is no noise'. But with a misspecified model this is possible. Here is a situation where more data can lead to "overfitting":

example

It is not overfitting due to fitting of random noise, but overfitting due to the model being misspecified and the bias of interpolation is increasing as the fitted curve can become less smooth (this is a case with polynomials, but a similar effect may occur when fitting with some NN, or the NN might even contain those polynomials as output).

It might not be unimaginable that this situation can be remedied by considering regularization and cross validation. In that case one might expect that more data should reduce the bias.

  • Not really a complete answer, I don't know a general theorem, but an extended comment on the intuition. – Sextus Empiricus Mar 14 '23 at 09:13
  • In which sense would such a model be "misspecified"? I know misspecification as meaning that the true relationship cannot be correctly described by the model. But the situation you describe can arise even if this is not the case. It's just that there is an infinite amount of possible function trajectories to choose from. This seems to be more about a lack of useful inductive biases (you mention regularization) and less about misspecification to me. – Eike P. Mar 14 '23 at 12:26
  • @EikeP. it is misspecified in the sense that a polynomial is fitted to a function that is itself not a polynomial. – Sextus Empiricus Mar 14 '23 at 13:20
  • ... but it can correctly describe the polynomial with arbitrary precision (as per the various universal approximation theorems). – Eike P. Mar 14 '23 at 13:50
  • @EikeP. you mean that the Taylor series, a polynomial, would correctly describe it if you just add enough terms? But that requires one to add a very large number of terms. For a finite sized problem, it can be that there is some point where increasing the number of terms will make the result rather worse than better because it might lead to this "overfitting" when more high frequency components are added. This would not occur when we fit the model with the 'actual true parametric function' where we only need to estimate a few parameters. – Sextus Empiricus Mar 14 '23 at 13:59
  • Also that sine function might be problematic. I can imagine that there are functions like something similar to sin(1/x) which has oscillations going to infinite frequency where a Taylor series diverges rather than converges. – Sextus Empiricus Mar 14 '23 at 14:03
  • That's exactly my point. If we already know the right functional form of our data, we only need to estimate the parameters. In a no noise setting, there should be no difference in the estimation if we use $n$ or $N$-parameters (given they are enough to specify the parameters). However, for a model which does not have the right functional form more data should generally better the fit. But I think there is also a point where adding more data will not improve the fit. – NicAG Mar 15 '23 at 04:09
  • I would be really interested in a more general mathematical approach showing the relation between the amount of data points I have and how well the model will approximate the true function. – NicAG Mar 15 '23 at 04:10