I am trying to find a way to calculate perplexity of a language model of multiple 3-word examples from my test set, or perplexity of the corpus of the test set. As the test set, I have a paragraph which I've split into 3-word examples like this: if the corpus is "Hello my name is Jack.", my 3-word examples would be "Hello my name", "my name is" and "name is Jack". The first 2 words are fed into the language model and the third is the correct label.
I've tried to compile a function which would calculate perplexities from the following links: https://stackoverflow.com/questions/44697318/how-to-implement-perplexity-in-keras
- here I've chosen this function to calculate perplexity of individual examples
from keras import backend as K
def perplexity(y_true, y_pred):
"""
The perplexity metric. Why isn't this part of Keras yet?!
https://stackoverflow.com/questions/41881308/how-to-calculate-perplexity-of-rnn-in-tensorflow
https://github.com/keras-team/keras/issues/8267
"""
#cross_entropy = K.sparse_categorical_crossentropy(y_true, y_pred)
cross_entropy = K.categorical_crossentropy(y_true, y_pred)
perplexity = K.exp(cross_entropy)
return perplexity
- my labels are one-hot and I've found somewhere that for one-hot labels,
categorical_crossentropyshould be used instead ofsparse_categorical_crossentropy. This function seems to work fine for individual examples (weird examples get higher perplexity, while normal examples get lower).
According to this question I've tried to compile a function for final perplexity, which would give me only one number instead of a number for each individual example: How to find the perplexity of a corpus
This is the code I've come up with:
def total_perplexity(perplexities, N):
# Perplexities is tf.Tensor
# N is vocab size
log_perp = K.log(perplexities)
sum_perp = K.sum(log_perp)
divided_perp = sum_perp / N
return np.exp(-1 * sum_perp)
- here
perplexitiesis the outcome ofperplexity(y_true, y_pred)function. However, for different examples - some of which make sense and some of which are total gibberish, the final perplexity tends to get towards 1 for smaller texts and tends to go to 0 as the size of the corpus grows.
What am I doing wrong?
Alternatively, do you recommend any other metrics for evaluating my language model?
EDIT: I've tried also a different approach, taking the true label's prediction probability (model.predict(...)) as the word's probability and putting that into the formula for implementing the corpus perplexity. Here is the new function:
def final_perplexity(y_true, y_pred, vocab_size):
# Perplexities is tf.Tensor
# N is vocab size
one_hot_indices = [np.where(r==1)[0][0] for r in y_true]
one_hot_probabilities = y_pred[range(len(one_hot_indices)),one_hot_indices]
log_perp = K.log(one_hot_probabilities)
divided_perp = log_perp / vocab_size
sum_perp = K.sum(divided_perp)
return np.exp(-1 * sum_perp)
Now the perplexity goes into infinity for larger corpus. How do I solve this?