1

I'm training 5 stacked Bi-LSTMs on an NLP task. The network fits well with a time series of length 30, and converges to around 0.97 AUROC. However, when I increase the length of the time series to 50, this happens:

Loss rapidly shrinks in both training and validation, before bouncing back up

I'm not using masking (slows training down) and some time steps are padded entries, the proportion of which will have increased when I increased the length of the time series from 30 to 50.

I've tried a couple of different learning rates from 0.001 to 0.00005. This model is at the limit of my hardware so I'm looking for some hints to fix it without resorting to extensive hyper parameter tuning. The loss function is binary cross entropy with a single binary output and the optimiser is Adam.

Any ideas what I should look for?

Sycorax
  • 90,934
  • Does this behavior evince if you use masking? Does this behavior occur if you use vanilla SGD without momentum? I'm wondering if the network tends to predict the padding character for everything, giving the early decrease in loss, but then Adam's momentum takes over and the optimizer can't change direction quickly enough, causing it to assign high probability to the padding character alone. If it also occurs with vanilla SGD, it could be that the gradients have vanished, so it can't correct itself. – Sycorax Feb 16 '21 at 15:14
  • Both the 30-length and 50-length datasets have padding. The 50-length will certainly have more padded time steps. I don't know if this is a red-herring or not though! Your idea is an interesting one. Each epoch has 12,000 batches, so although the reductions in loss look very steep, they are happening over roughly 4x12,000 parameter updates before the loss starts to climb back up. I can run an experiment with SGD and see if it is momentum, thanks! – Noel Kennedy Feb 16 '21 at 15:28
  • 1
    Sorry, I meant to type “masking” in my first sentence. (I haven’t had my coffee yet.) – Sycorax Feb 16 '21 at 15:31
  • Thanks for your help! – Noel Kennedy Feb 22 '21 at 11:25

1 Answers1

1

It turns out this was only co-incidentally to do with the increase in the length of the time series. I was using a generator to iterate the training set. With the increase in the length of the time series, I tweaked the generator to use less RAM and introduced a bug. It turned out that the generator was progressively scrambling the training examples. On the first few epochs, the training samples were not so scrambled that the model couldn't find a pattern to fit, so the AUROC increased. As the scrambling was random, any pattern in the data progressively became noise and the model soon starts to under-estimate the amount of scrambling on the current epoch and so the loss increases each time.