I am training a LSTM model to do question answering, i.e. so given an explanation/context and a question, it is supposed to predict the correct answer out of 4 options.
My model architecture is as follows (if not relevant please ignore): I pass the explanation (encoded) and question each through the same lstm to get a vector representation of the explanation/question and add these representations together to get a combined representation for the explanation and question. I then pass the answers through an LSTM to get a representation (50 units) of the same length for answers. In one example, I use 2 answers, one correct answer and one wrong answer. From this I calculate 2 cosine similarities, one for the correct answer and one for the wrong answer, and define my loss to be a hinge loss, i.e. I try to maximize the difference between the cosine similarities for the correct and wrong answers, correct answer representation should have a high similarity with the question/explanation representation while wrong answer should have a low similarity, and minimize this loss.
The problem I find is that the models, for various hyperparameters I try (e.g. number of hidden units, LSTM or GRU) the training loss decreases, but the validation loss stays quite high (I use dropout, the rate I use is 0.5), e.g.
My dataset contains about 1000+ examples. Any advice on what to do, or what is wrong?
