I use a neural network which consists of:
- input layer (2 neurons)
- hidden layer (2 neurons, 2 biases)
- output layer (1 neuron, 1 bias)
The weights and biases are random initialized from range [-1, 1].
I use learning rate of 1 (if it's either 0.01, 0.1, 0.2, 0.5, 0.7 or 2 the NN converges in more iterations), sigmoid as activation function and stochastic gradient descent as a learning algorithm.
If the MSE is less than 0.001, the output for XOR is like:
[0, 0]->0.031[0, 1]->0.971[1, 0]->0.971[1, 1]->0.030
And if the MSE is less than 0.0001, the output is:
[0, 0]->0.009[0, 1]->0.991[1, 0]->0.991[1, 1]->0.008
So when I train the NN to get the MSE < 0.001, most of the time it takes ~10000 iterations. Less often, like 1/10 times, it takes ~40000 iterations, sometimes even ~100000 or ~1000000 and sometimes it's even unable to get this error (I skip it when it can't get this error with less than 1 billion iterations).
When I train it to get the MSE < 0.0001, the usual number of iterations is ~67000. Less often, like 1/20 times, it takes hundreds thousands/millions of iterations and also it's sometimes unable to get this error.
Thus, my questions are:
- Is the MSE < 0.001 enough (not only for XOR, but also for other problems, like handwritten digits recognition)? Maybe 0.1 would be enough?
- Aren't the iterations too high? I mean, what's the average number of iterations it should take?
- Is it normal that it's sometimes even unable to get these small errors or it takes for e.g. MSE < 0.001 even hundreds thousands or millions of iterations? Shall I restart the NN when it doesn't converge?

