False-error rate in a Pearson test, when approximation by a $\chi^2$ distribution is invalid?

Question

The question arises in a cryptographic context involving a regulatory test of a physical source or random bits, with null hypothesis that they are independent and unbiased. $n$ samples of 4 bits are drawn ($n=128$ or $80$), the number of samples $O_i$ in each of the 16 bins is counted, and the source assumed defective if $$65.0<\sum\frac{(O_i-n/16)^2}{n/16}$$

The regulation-endorsed [KS2011] A proposal for: Functionality classes for random number generators, version 2.0, item 408, gives a false-error rate of $3.8\cdot 10^{-7}$ for $n=128$. The secondarily-endorsed [AIS31V1] A proposal for: Functionality classes and evaluation methodology for true (physical) random number generators, version 3.1, example E.6, gives the same false-error rate for $n=80$. Both my attempted exact computation and Monte-Carlo simulation suggest that the value of the false-error rate is correct in [AIS31V1] only, and the justification given (approximation by the $\chi^2$ distribution, which would give a false-error rate of $3.4\cdot 10^{-8}$) unusable to derive the correct value.

I'm thus asking how to directly derive the false-error rate for this test, preferably with an authoritative reference; then, in hope to explain a much higher error rate observed in practice, the expected effect on the false-error rate of a slight bias in the source bits (e.g. if the bits are assumed independent with mean $0.5+\epsilon$).

Update: I understand why the approximation by a $\chi^2$ distribution does not work; how I can make Monte-Carlo simulations; and how in principle I can calculate exactly the odds that the test fails (for $\epsilon=0$, my C code counting exact odds of each possible value of the test result is usable for $n$ multiple of $16$ up to $160$, giving results not contradicted by simulations). Problems are I'd like references; and this exact approach hits a computational wall for $\epsilon\ne0$.

This shows my tentative results for false-error rate (for $\epsilon=0$) as a function of the threshold, for different $n$ and per the $\chi^2$ distribution approximation. False-error rate for a Pearson test

StasK · Accepted Answer · 2013-01-04T16:48:52.533

I think the central issue of the discrepancy between the quoted and the actual (simulated) rate arises because the asymptotic $\chi^2_{15}$ distribution is a very lousy approximation for the tails of sampling distribution. It may work OK near its center (around 15, give or take 5 or so), but pushing it to tiny tail probabilities is plain inappropriate. By Murphy's law, the probabilities get on the bad side, i.e., the approximation gives you something that's too small. You have to take higher order approximations, like saddlepoint approximations, to get these tail probabilities better; I am sure some exist for this Pearson test, but I cannot point any right away. If you have a lot of computing power at your disposal (as you might), you could try brute forcing the multinomial probability computation that would give you the exact answer.

If the bits are off by $\epsilon$, then the bin with $k$ ones and $4-k$ zeroes has the probability of $(0.5+\epsilon)^k (0.5-\epsilon)^{4-k}$. With some effort, you can probably derive the non-centrality parameter for the corresponding non-central chi-square distribution. My guess, off the top of my head, is that it will be the quadratic form with the vector given by the differences of the above "true" means vs. 0.5, and the inverse of the multinomial covariance matrix in the middle. This is a tedious but relatively straightforward work typical for power analysis. The non-central chi-square has more mass to the right, so the error rate will go up with these $\epsilon$ biases. Update: It is applicable for the shifted situation whenever the central chi-square is applicable for the central situation, but there is also evidence that it works a little better in finite samples for the central situation, too, when the test statistic has a bias of $O(1/n)$. A lot of likelihood ratio test statistics do have a bias like that, which is generally rectified by Bartlett correction. Pearson test may also have this sort of bias, and the non-central chi-square might help, although again it will mostly help in the center of the distribution, less so in the tails.

Sorry that I am only giving you pointers, not the definitive answers. The latter may exist out there, but if you, being the expert in your field, are not familiar with them, chances are that there aren't any.

Thanks. One thing I do not get in your answer is why a non-central chi-square distribution would be a valid approximation when $\epsilon\ne0$, if a chi-square distribution is not a valid approximation when $\epsilon=0$. Note: see the update of the question for my current status. — fgrieu, Jan 04 '13 at 07:14

False-error rate in a Pearson test, when approximation by a $\chi^2$ distribution is invalid?

1 Answers1

Linked