After having read matus's beautiful answer in this thread explaining (among other things) Cybenko's proof of the Universal Approximation Theorem for Neural Networks, I wonder: if we use a piecewise constant function to approximate the target function, won't this have horrendous implications for generalization?
In practice, the only information we have about the target function comes from the training data, which makes up a discrete set. The machine learning task, as perhaps suggested from the Universal Approximation Theorem, could be dealt with by taking 'the simplest' piecewise constant function which perfectly fits the training set, and representing that as a neural net with one hidden layer, sigmoid transfer functions, and an identity output node.
But one can intuitively tell that such a trained model would perform terribly on test data! Because given an unseen case C, lying between training cases A1, A2, ..., Ak, instead of assigning as output some middle ground between the output for cases A1 .. Ak, it will assign exactly the output for Ap, where Ap is the nearest of the Ai to C.
So is there not a lot less value in this theorem if it cannot enrich the intuition of a machine learning engineer wishing to construct the optimal neural network for a given task?