How to avoid underflow of the probability of sentence in calculating the perplexity of corpus

Question

I am looking at this post How to find the perplexity of a corpus. I understand the whole post, but

the probability of a sentence appear in a corpus, in a unigram model, is given by p(s)=∏ni=1p(wi), where p(wi) is the probability of the word wi occurs.

For a large corpus, even if I only calculate the probability of a sentence by using ∏ for each word in the sentence, I still get a 0 proabibilty which causes error in the following log2 calculation. Can someone help me?

score 1 · Accepted Answer · answered Sep 19 '22 at 02:50

We deal with underflow in probability calculations by working in log-space (i.e., dealing with log-probabilities instead of probabilities for all intermediate calculations). This often requires us to add or subtract probabilities in log-space or conduct other computations in log-space. In the present case, the equation of interest:

$$p(s) = \prod_{i=1}^n p(w_i),$$

can be written easily in log-space as:

$$\ell(s) = \sum_{i=1}^n \ell(w_i),$$

where $\ell(s) \equiv \log p(s)$ and $\ell(w_i) \equiv \log p(w_i)$ are the log-probabilities.

When undertaking probability calculations in log-space, you should do all intermediate calculations in log-space and then convert back to regular probability space only at the end of your calculations. This will avoid underflow by ensuring that small probabilities are represented in intermediate calculations only through their log-probabilities.

How to avoid underflow of the probability of sentence in calculating the perplexity of corpus

1 Answers1