There are two common RNN strategies.
- You have a long sequence that's always contiguous (for example, a language model that's trained on the text of War and Peace); because the novel's words all have a very specific order, you have to train it on consecutive sequences, so the hidden state at the last hidden state of the previous sequence is used as the initial hidden state of the next sequence.
The way most people do this is that you'll have to traverse the sequences in order, and not shuffle. Suppose you use mini-batch size of 2. You can cut the book in half, and the first sample will always have text from the first half of War and Peace and the second sample will always have text from the second half. Instead of using samples at random, the text is always read in order, so the first sample in the first mini-batch has the first words of the text, and the second sample in the first mini-batch has the first words after the mid-point of the text.
Purely abstractly, I suppose you could do something more complicated where you shuffle the data but can compute the initial hidden state for each position in the sequence (e.g. by computing the text up until that point, or else saving & restoring states) but this sounds expensive.
- You have lots of distinct sequences (such as discrete tweets); it can make sense to start each sequence with hidden states of all 0s. Some people prefer to train a "baseline" initial state (Sam Weisenthal's's suggestion). I read an article advocating doing this if your data has lots of short sequences but I can't find the article now.
Which strategy is appropriate depends on the problem, and specific choices about how to represent that problem.
From the perspective of developing software, an ideal implementation would somehow expose functionality for both options to users. This can be tricky, and different software (pytorch, tensorflow, keras) achieves this in different ways.