Generalization Issues with Practical Suggestions from Universal Approximation Theorem with Neural Networks

Question

After having read matus's beautiful answer in this thread explaining (among other things) Cybenko's proof of the Universal Approximation Theorem for Neural Networks, I wonder: if we use a piecewise constant function to approximate the target function, won't this have horrendous implications for generalization?

In practice, the only information we have about the target function comes from the training data, which makes up a discrete set. The machine learning task, as perhaps suggested from the Universal Approximation Theorem, could be dealt with by taking 'the simplest' piecewise constant function which perfectly fits the training set, and representing that as a neural net with one hidden layer, sigmoid transfer functions, and an identity output node.

But one can intuitively tell that such a trained model would perform terribly on test data! Because given an unseen case C, lying between training cases A1, A2, ..., Ak, instead of assigning as output some middle ground between the output for cases A1 .. Ak, it will assign exactly the output for Ap, where Ap is the nearest of the Ai to C.

So is there not a lot less value in this theorem if it cannot enrich the intuition of a machine learning engineer wishing to construct the optimal neural network for a given task?

To illustrate my last point: suppose this theorem was very useful in the sense that it could enrich the intuition of a machine learning engineer designing his network's architecture. He would think: "let's use just 1 hidden layer, sigmoid transfer functions, and an identity output layer - I know this will approximate my target function very well". But his network will heavily overfit, so it will be no good. He'll be back to square 1, with no theory to guide his architecture design. — Alexandre Holden Daly, Jan 20 '14 at 12:15
there seems to be a gap between theory & application in this area that hasnt been filled out by anyone. the applied crowd is just looking for convergence of solutions that minimize error. this is apparently generally determined by the particular application/problem under study & not so much the general theory. in other words these questions & similar ones are generally answered empirically rather than theoretically and sometimes simpler networks do in fact achieve sufficient performance on certain problems. ie, ad hoc/ case-by-case basis. — vzn, Jan 20 '14 at 15:41
Thanks vzn - doesn't this just highlight how useful theory could be? We wouldn't be fumbling heuristically (or blindly in my case) if we had more architecture theorems! (theorems acknowledging the stochastic aspect of the problem to account for generalization issues) — Alexandre Holden Daly, Jan 21 '14 at 00:48

Generalization Issues with Practical Suggestions from Universal Approximation Theorem with Neural Networks

0 Answers0

Linked