How to find the perplexity of a corpus

Question

The formula of the perplexity measure is:

$$p: \left(\frac{1}{\sqrt[n]{p(w_1^n)}}\right)$$

where: $p(w_1^n)$ is: $\prod_{i=1}^n p(w_i)$.

If I understand it correctly, this means that I could calculate the perplexity of a single sentence. What does it mean if I'm asked to calculate the perplexity on a whole corpus?

score 19 · Accepted Answer · edited Feb 20 '24 at 02:34

First, Just a small correction: if we have a sentence $s$ that contains $n$ words, its perplexity $\newcommand{\Perplexity}{\rm Perplexity} \Perplexity(s)$ is:

$$ \Perplexity(s) = \sqrt[n]{\frac{1}{p(w_1^n)}} $$

If we want to know the perplexity of the whole corpus $C$ that contains $m$ sentences and $N$ words, we have to find out how well the model can predict all the sentences together. So, let the sentences $(s_1, s_2, ...,s_m)$ be part of $C$. The perplexity of the corpus, per word, is given by:

$$ \Perplexity(C) = \sqrt[N]{\frac{1}{P(s_1,s_2,...,s_m)}} $$

The probability of all those sentences being together in the corpus $C$ (if we consider them as independent) is:

$$P(s_1,...,s_m) = \prod_{i=1}^{m} p(s_{i})$$

As you said in your question, the probability of a sentence appear in a corpus, in a unigram model, is given by $p(s)=\prod_{i=1}^{n}p(w_i)$, where $p(w_i)$ is the probability of the word $w_i$ occurs.

We are done.

But hold on. Since probabilities are given as a real number between 0 and 1, the product $\prod_{i=1}^{m} p(s_{i})$ gets small quickly, and you can have an error in some computer systems (think of underflow). So, we can use the following transformations to replace the multiplications by additions:

\begin{align} \Perplexity(C) &= \sqrt[N]{\frac{1}{\prod_{i=1}^{m} p(s_{i})}} \\ &= 2^{\log_{2}{[\prod_{i=1}^{m} p(s_{i})]}^{-\frac{1}{N}}} \\ &= 2^{-\frac{1}{N}\log_{2}{[\prod_{i=1}^{m} p(s_{i})]}} \\ &= 2^{-\frac{1}{N}\sum_{i=1}^{m}\log_{2}{p(s_i)}} \end{align}

And this is the perplexity of the corpus to the number of words. If you feel uncomfortable with the log identities, Google for a list of logarithmic identities.

For further reading: NGrams (pdf).

I think there may be an error in the last formula. It should be PI^(-1/N) instead of PI^(-N), although the follow results are correct. — cn123h, Jan 20 '19 at 10:47
the base of log could be any number, not must be 2, could be e or 10 which are more convenient for programming. — cn123h, Jan 20 '19 at 14:59

How to find the perplexity of a corpus

1 Answers1

Linked