LSTM Classifying all the words as the same class

Question

I've used Lasagne to build a LSTM model to classify words with the IOB-tags. About 25-40% of the training words classes is O, thus receiving the same int32 class number 126.

The words go through a context window method, in order to increase the number of features, and be influenced by the neighboring words.

After that, the words(with their context window) go through a word embedding process, before being fed to the model.

At the first training words, my model classify the words with different classes, then it starts classifying a lot of words with the same class:

[ 54   9 119  41  77   1   1  96  96  84  84  96  96  96  96  45  74  34   34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34  34]

[ 54  85   7 119  22   7 115  84  62  62  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71  71]

[ 85   1  83 113  13  36  82  58 126   2   2  17  19 117  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25]

After some training, it starts classifying every word as 126, the index of the O class.

It looks as a hyper-paramether configuration problem, but I don't have a clue of how to fix it. Can someone give me a hint? Thank you.

score 1 · Answer 1 · answered Nov 19 '15 at 10:05

It doesn't look as a hyper-parameter problem to me. If the algorithm converges to the majority class in the data, then this usually an indication that the algorithm is unable to distinguish between the classes based on the provided features.

Did you set up this problem/task yourself? or are you using a set-up that has bee tried out before. In the latter case then it may still be that you have to tweak some things. I personally would recommend to take a look at the activation functions first.

As a general side-note it is not recommended to take the hidden state activations as the predictions in a recurrent network (or LSTM), but instead use an output layer that translates the hidden activations to outputs (i.e. a softmax layer). If you are not already doing this then I would recommend you first add an output layer, and try again.

I've built the model but the problem is well-known. I'm classifying the ATIS dataset into IOB tags.
I'm already using a reshape layer between the hidden LSTM layer and the output layer with the softmax activation function.

Here is the main code. You only need Lasagne installed and the ATIS data to run it.

Thank you. — Lucas Azevedo, Nov 19 '15 at 16:00
What makes it more strange to me is that it's like the model seems to be enforced to classify LOTS of words with the LEAST amount of classes. It only learns which class is the best ONE to classify ALL the words, optimizing this heuristic. — Lucas Azevedo, Nov 20 '15 at 11:58

LSTM Classifying all the words as the same class

1 Answers1