Confidence Interval of entropy for a discrete distribution

Question

The following table is a set of ordinal data from a survey I have conducted (one of many).

$$\begin{array}{c|c|c|} \text{Grading}& \text{Count} & \text{Frequency} \\ \hline \text{1} & 5 & 0.075 \\ \hline \text{2} & 3 & 0.045 \\ \hline \text{3} & 12 & 0.179 \\ \hline \text{4} & 10 & 0.149 \\ \hline \text{5} & 19 & 0.284 \\ \hline \text{6} & 11 & 0.164 \\ \hline \text{7} & 7 & 0.104 \\ \hline \text{Sum} & 67 & 1 \\ \hline \end{array}$$

I can work out the entropy of the given distribution easily enough where $H=2.618$ and $H_{max}=2.807$. One test that I would like to perform, however, is the confidence interval (say 95%) of the entropy for such a discrete data set.

Despite my efforts, I have not found a test to calculate this and would be surprised if this had not been done before. Can someone point me in the direction of a suitable test?

Does this page help? That explains why this is not so straightforward as it might seem, given the inherent downward bias of the the usual estimator for Shannon entropy. This answer notes the R simboot package as one implementation of more reliable approaches. — EdM, Aug 01 '22 at 16:11
@EdM. Thanks for this. I was hoping this was a 'simple' test but seems way beyond the needs for my analysis. — Mari153, Aug 01 '22 at 21:19

bjarkemoensted · Answer 1 · 2022-08-04T08:54:18.720

NOTE: This approach does not give the true confidence interval, as explained in the comments. I'll leave this up because it still provides some notion of how volatile the estimate is, and may be valuable to some.

I think you can use bootstrapping for this, simply sampling with replacement, computing the entropy for each sample, and find the percentiles from there.

Example (in python) below:

from collections import Counter
import numpy as np
def entropy(arr):
    counts = Counter(arr)
    frequencies = [n/len(arr) for n in counts.values()]
    H = -sum(p*np.log2(p) for p in frequencies)
    return H
counts = [5, 3, 12, 10, 19, 11, 7]
scale = list(range(1, 8))
grades = sum([count*[grade] for count, grade in zip(counts, scale)], [])
print(entropy(grades))  # prints 2.618...
entropies = []
N_bootstrap = int(10**4)
randomstate = np.random.RandomState(seed=42)
for _ in range(N_bootstrap):
    sample = randomstate.choice(grades, size=len(grades), replace=True)
    H = entropy(sample)
    entropies.append(H)
interval_width = 95
cut = (100 - interval_width)/2
lower, upper = np.percentile(entropies, [cut, 100 - cut])
95% CI: (2.344, 2.705).
print(f"{interval_width}% CI: ({lower:.3f}, {upper:.3f}).")

That doesn't work for the Shannon entropy in the OP, as the plug-in estimator is inherently downward-biased. Bootstrapped confidence intervals can even omit the point estimate! See this page for an example, detailed discussion, and alternate approaches. — EdM, Aug 01 '22 at 16:05
@bjarkemoensted thanks for this. May not be a perfect answer re EdM's comment but it's a pragmatic approach. Sometimes in stats, pragmatic approaches are more worthwhile than strict statistical purity... — Mari153, Aug 01 '22 at 21:24

Confidence Interval of entropy for a discrete distribution

1 Answers1

95% CI: (2.344, 2.705).