Aren't my iterations needed to train NN for XOR with MSE < 0.001 too high?

Question

I use a neural network which consists of:

input layer (2 neurons)
hidden layer (2 neurons, 2 biases)
output layer (1 neuron, 1 bias)

The weights and biases are random initialized from range [-1, 1].

I use learning rate of 1 (if it's either 0.01, 0.1, 0.2, 0.5, 0.7 or 2 the NN converges in more iterations), sigmoid as activation function and stochastic gradient descent as a learning algorithm.

If the MSE is less than 0.001, the output for XOR is like:

[0, 0] -> 0.031
[0, 1] -> 0.971
[1, 0] -> 0.971
[1, 1] -> 0.030

And if the MSE is less than 0.0001, the output is:

[0, 0] -> 0.009
[0, 1] -> 0.991
[1, 0] -> 0.991
[1, 1] -> 0.008

So when I train the NN to get the MSE < 0.001, most of the time it takes ~10000 iterations. Less often, like 1/10 times, it takes ~40000 iterations, sometimes even ~100000 or ~1000000 and sometimes it's even unable to get this error (I skip it when it can't get this error with less than 1 billion iterations).

When I train it to get the MSE < 0.0001, the usual number of iterations is ~67000. Less often, like 1/20 times, it takes hundreds thousands/millions of iterations and also it's sometimes unable to get this error.

Thus, my questions are:

Is the MSE < 0.001 enough (not only for XOR, but also for other problems, like handwritten digits recognition)? Maybe 0.1 would be enough?
Aren't the iterations too high? I mean, what's the average number of iterations it should take?
Is it normal that it's sometimes even unable to get these small errors or it takes for e.g. MSE < 0.001 even hundreds thousands or millions of iterations? Shall I restart the NN when it doesn't converge?

Thanks in advance.

Yes, the numbers seem extremely high. This is basic learning example for a NN with a single hidden layer; there are very few weights to be learned. My guess would be that "something" is going terribly wrong at some point; but from the available information it's hard to say what that might be. — cherub, Jun 14 '18 at 12:48

Sycorax · Answer 1 · 2019-06-04T21:44:56.540

Tensor Playground is an interactive interface for developing neural networks to solve toy problems. Since the authors have already done the QA on their code, it makes it very easy to compare your results to a "gold standard" for this toy problem.

Using a network structure with 2 inputs, 2 hidden neurons and 1 output neurons and sigmoid activations, the network is slow to train. The decision boundary is not always the right shape, sometimes isolating just one quadrant, or creating a diagonal -- there are lots of options for how to orient a shape that is "mostly" right. After 4500 iterations, it kind of looks right. While drawing a blue band may or may not be what you had in mind, the extreme points farthest from the origin are all in the correct class; this is consistent with the 4 points given in OP's toy data. In this sense, the results are consistent.

You can let it run for a while to decide whether your result of extremely long training times are reproduced in Tensor Playground.

Keep in mind that the proof that an XOR network can work with this configuration (sigmoid units, 2-2-1 architecture) doesn't mean that it's easy to train. If we use a 2-4-2-1 architecture and $\tanh$ units the problem is much easier. This is the network after 200 training iterations; it's basically perfect.

As an aside, I think that part of the reason you're having trouble is that you're using MSE as a loss function (but there could also be bugs or other misspecification that are creating trouble). MSE has shallow gradients; XOR can be instead framed as a classification task, and cross-entropy loss has steeper gradients.

Neural networks require more experimentation than other approaches, so if what you're trying doesn't suit your needs, try something else!

Here's a checklist of things that I would look at to try and get this to work, roughly in order

write unit tests & check for bugs
poor initialization of weights
learning rate too high/too low
learning rate scheduling

The tensorflow playground example for why the 2-2-1 architecture is hard to train is misleading, since there are no bias terms present, which you need in the "model answer". — Nikolas Kuhn, Jul 29 '22 at 21:08

Aren't my iterations needed to train NN for XOR with MSE < 0.001 too high?

Thanks in advance.

1 Answers1

Linked

Related