What to do if training loss decreases but validation loss does not decrease?

Question

I am training a LSTM model to do question answering, i.e. so given an explanation/context and a question, it is supposed to predict the correct answer out of 4 options.

My model architecture is as follows (if not relevant please ignore): I pass the explanation (encoded) and question each through the same lstm to get a vector representation of the explanation/question and add these representations together to get a combined representation for the explanation and question. I then pass the answers through an LSTM to get a representation (50 units) of the same length for answers. In one example, I use 2 answers, one correct answer and one wrong answer. From this I calculate 2 cosine similarities, one for the correct answer and one for the wrong answer, and define my loss to be a hinge loss, i.e. I try to maximize the difference between the cosine similarities for the correct and wrong answers, correct answer representation should have a high similarity with the question/explanation representation while wrong answer should have a low similarity, and minimize this loss.

The problem I find is that the models, for various hyperparameters I try (e.g. number of hidden units, LSTM or GRU) the training loss decreases, but the validation loss stays quite high (I use dropout, the rate I use is 0.5), e.g.

My dataset contains about 1000+ examples. Any advice on what to do, or what is wrong?

score 9 · Accepted Answer · edited Aug 21 '20 at 11:26

Welcome to DataScience. This looks like a typical of scenario of overfitting: in this case your RNN is memorizing the correct answers, instead of understanding the semantics and the logic to choose the correct answers.

A typical trick to verify that is to manually mutate some labels. For instance, you can generate a fake dataset by using the same documents (or explanations you your word) and questions, but for half of the questions, label a wrong answer as correct. If you re-train your RNN on this fake dataset and achieve similar performance as on the real dataset, then we can say that your RNN is memorizing.

Note that it is not uncommon that when training a RNN, reducing model complexity (by hidden_size, number of layers or word embedding dimension) does not improve overfitting.

If it is indeed memorizing, the best practice is to collect a larger dataset. I understand that it might not be feasible, but very often data size is the key to success. If you haven't done so, you may consider to work with some benchmark dataset like SQuAD or bAbI.

thanks, I will try increasing my training set size, I was actually trying to reduce the number of hidden units but to no avail, thanks for pointing out! — liyuan, Sep 05 '18 at 07:52
Hey there, I'm just curious as to why this is so common with RNNs. For me, the validation loss also never decreases. Is there a solution if you can't find more data, or is an RNN just the wrong model? On the same dataset a simple averaged sentence embedding gets f1 of .75, while an LSTM is a flip of a coin. — rocksNwaves, Aug 21 '21 at 22:17

score 4 · Answer 2 · answered May 24 '21 at 12:29

I had this issue - while training loss was decreasing, the validation loss was not decreasing. I checked and found while I was using LSTM:

I simplified the model - instead of 20 layers, I opted for 8 layers.
Instead of scaling within range (-1,1), I choose (0,1), this right there reduced my validation loss by the magnitude of one order
I reduced the batch size from 500 to 50 (just trial and error)
I added more features, which I thought intuitively would add some new intelligent information to the X->y pair. While this is highly dependent on the availability of data.

What to do if training loss decreases but validation loss does not decrease?

2 Answers2