Setup
Let's assume we have data $(x_{i},y_{i})_{i=1}^N$ and we assume they have a true functional relation $y_{i}=g(x_{i})$. I.e. we choose an arbitrarly complex function (like $g(x)=\sin(x^{2}\cdot \ln(\tanh(x)))+x e^x+\dots$)
Now I want to fit a feed-forward-NN $f_{\theta}$ with witdh $W$ and depth $D$ to the data to approximate $g$ in the best possible way, i.e. we minimize $$l(x,y)=\frac{1}{N}\sum_{i=1}^N(f_{\theta}(x_{i})-g(x_{i}))^2$$ with gradient descent.
What I don't mean
First I want to clarify what I don't mean. For noisy data I understand that more data is better basically due to the law of large numbers. I also know that in statistical learning there are bound on the empirical risk minimization. I.e. if the data is i.i.d. there is an upper bound to the empirical risk 1: $$L_{p}(f_{\theta}) \leq \hat{L}_{p}(f_{\theta})+2 \mathcal{R}_{m}(\mathcal{F})+\mathcal{O}\left(\sqrt{\frac{\ln (1 / \delta)}{m}}\right)$$
Question
However, now we have data without uncertainity. Let's assume we only train on data up to $1 \ll n \ll N$. Is there a general theorem showing that training on $(x_{i},y_{i})_{i=1}^n$ will lead to a worse fit, i.e. we will not be able to find the same ideal parameters as with $(x_{i},y_{i})_{i=1}^N$ or is just the convergence rate slower?
My intuition
My intuition is that, given $(x_{i},y_{i})_{i=1}^n$ and $(x_{i},y_{i})_{i=1}^N$ have the same "information" (I know this is a fuzzy term), using $(x_{i},y_{i})_{i=1}^N$ should not better the fit. It should even make generalisation worse since we overfit.
