Why would an LSTM converge to a fixed state when generating sequences?

Question

I want to generate some sequences using LSTMs, like in the char-rnn from karpathy, I do nearly everything identical. For that during training the network greatly decreases the error over time, until it converges in some local minima of the error function (I assume). My first question would be as to what reasons there are for this behaviour and what could be done to reduce the training error even more.

But more importantly, my second question is about sampling: I work with discrete values and predict a softmax distribution for the output. For sampling I output values w.r.t to this distribution. When I initialize the network with some batches, i.e. I feed the network the inputs of the batches and take its output, the output is very close lto the original input, which is good. But when I "let the network lose", i.e. feed it its own output as input to generate longer sequences, in al my experiments the network will converge to some fixed state, meaning after some time it will almost always output the same value. Why is that happening? I never had such stationary sequences in my inputs. What can be done to get rid of this behaviour?

For both my questions I would have some thoughts, but I would really like your opinion. DO you think there is something wrong with my implementation or is this a known issue for LSTMs? I could not find anything on the net.

score 1 · Answer 1 · edited Mar 05 '18 at 16:42

Its fairly common to get repetitive sequences out of LSTMs. They tend to 'forget' the earlier outputs, and so enter a loop, like, from Karpathy's blog page, http://karpathy.github.io/2015/05/21/rnn-effectiveness/ :

is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same

A partial solution to this is to add some stochasticity to the prediction process, by using the output distribution as a probability distribution, and sampling from that, at each step, rather than simply taking the most likely prediction at each step. In Karpathy's char-rnn implementation, ie https://github.com/karpathy/char-rnn , this is implemented using an adjustable 'temperature' parameter, which smoothly adjusts between stochastic outputs, and choosing the most likely output at each step.

Edit: if you suspect that your implementation may have a bug, one way to check is to set the random seed in your own implementation equal to that of Karpathy's, and check you get exactly the same outputs, eg see pull request https://github.com/karpathy/char-rnn/pull/68 for an example of such a comparison. (yes, its my own pull request :P )

Why would an LSTM converge to a fixed state when generating sequences?

1 Answers1