I want to generate some sequences using LSTMs, like in the char-rnn from karpathy, I do nearly everything identical. For that during training the network greatly decreases the error over time, until it converges in some local minima of the error function (I assume). My first question would be as to what reasons there are for this behaviour and what could be done to reduce the training error even more.
But more importantly, my second question is about sampling: I work with discrete values and predict a softmax distribution for the output. For sampling I output values w.r.t to this distribution. When I initialize the network with some batches, i.e. I feed the network the inputs of the batches and take its output, the output is very close lto the original input, which is good. But when I "let the network lose", i.e. feed it its own output as input to generate longer sequences, in al my experiments the network will converge to some fixed state, meaning after some time it will almost always output the same value. Why is that happening? I never had such stationary sequences in my inputs. What can be done to get rid of this behaviour?
For both my questions I would have some thoughts, but I would really like your opinion. DO you think there is something wrong with my implementation or is this a known issue for LSTMs? I could not find anything on the net.