I'm training 5 stacked Bi-LSTMs on an NLP task. The network fits well with a time series of length 30, and converges to around 0.97 AUROC. However, when I increase the length of the time series to 50, this happens:
I'm not using masking (slows training down) and some time steps are padded entries, the proportion of which will have increased when I increased the length of the time series from 30 to 50.
I've tried a couple of different learning rates from 0.001 to 0.00005. This model is at the limit of my hardware so I'm looking for some hints to fix it without resorting to extensive hyper parameter tuning. The loss function is binary cross entropy with a single binary output and the optimiser is Adam.
Any ideas what I should look for?
