11

I want to decide the capacity $C$ of a table so that it has residual odds less than $2^{-p}$ to overflow for given $p\in[40\dots 120]$, assuming the number of entries follows a Poisson law with a given expectancy $E\in[10^3\dots 10^{12}]$.

Ideally, I want the lowest integer C such that 1-CDF[PoissonDistribution[E],C] < 2^-p for given p and E; but I'm content with some C slightly higher than that. Mathematica is fine for manual computation, but I would like to compute C from p and E at compile time, which limits me to 64-bit integer arithmetic.

Update: In Mathematica (version 7) e = 1000; p = 40; c = Quantile[PoissonDistribution[e], 1 - 2^-p] is 1231 and seems about right (thanks @Procrastinator); however the result for both p = 50 and p = 60 is 1250, which is wrong on the unsafe side (and matters: my experiment repeats like $2^{25}$ times or more, and I want demonstrably less than $2^{-30}$ overall odds of failure). I want some crude but safe approximation using 64-bit integer arithmetic only, as available in C(++) at compile time.

fgrieu
  • 213
  • 1
    How about C = Quantile[PoissonDistribution[E],1-2^p]? –  Sep 04 '12 at 09:50
  • 1
    The leading term of the probability mass function of the Poisson dominates in the tail. – cardinal Sep 04 '12 at 10:25
  • 1
    @Procrastinator: yes that works in Mathematica (except for sign of p, and precision issues, and names E and C that are reserved). BUT I need a simple approximation of that, possibly crude (but on the safe side) using 64-bit integer arityhmetic only! – fgrieu Sep 04 '12 at 10:37
  • Given that the parameter $p$ is large, you may consider a Normal approximation to the Poisson distribution, see 1 and 2. –  Sep 04 '12 at 11:28
  • 3
    Re the update: Mathematica 8 returns 1262 for $p=50$ and 1290 for $p=60$. Re Normal approximation (@Proc): this cannot be expected to work well in the tails, which is crucial to the calculation. – whuber Sep 04 '12 at 11:41
  • @cardinal: It is true that the leading term dominates if you go far enough out, but not necessarily for the range of values fgrieu gave, particularly for large $\mu$. – Douglas Zare Sep 04 '12 at 11:49
  • @fgrieu: Are floating point numbers and the typical library functions like log not available? – Douglas Zare Sep 04 '12 at 11:54
  • @DouglasZare: I checked; under C (ANSI and ISO 9899 prior to 2011), floating point (and of course log, sqrt..) can not be used in an expression used to define the static size of an array (except for C99 VLA, which is seldom supported, and became optional in the new C11 standard); AND it is not possible to make intermediary variables for clarity, except as enum with the range limitation of int, or preproprocessor token with a 4095 character limitation after expression expansion. C++ is a lot more flexible, but still use of log and friends seems to be a no-no for conformant code. – fgrieu Sep 05 '12 at 10:02
  • Well, a crude upper bound is $2*\mu + p + 1.$ That's very coarse, but you could allocate that much memory first, and then find out how much you really need and only use that much of the array. – Douglas Zare Sep 05 '12 at 13:39
  • @Douglas Zare: I'd like accuracy within percents of E. There are a variable number of tables of size fixed at compile time (so that T[n][j] is the jth element of the nth table without intermediary pointers), and memory allocated in excess is not recoverable. Entries are assigned to one of the table per a cryptographic MAC (the Poisson hypothesis errs when there are very few tables, but does so on the safe side). An overflow could be detected, but is irrecoverable. – fgrieu Sep 06 '12 at 21:30
  • 1
    Perhaps you should ask on stackoverflow. I'm not familiar with the constraints you have. I don't know what stops you from using dynamic memory allocation, or whether you can use branching to decide the size of the array, or what the costs are of defining an array which is twice the size you need (and then not using all of it). If some function like $\mu + \sqrt{\log\log \mu} \log \mu \sqrt \mu + p \frac{\sqrt{\mu}} {\log \mu}$ (just as an example) gave you the exact answer, would you be able to implement an approximation under your constraints or not? It seems like a programming problem now. – Douglas Zare Sep 07 '12 at 00:26
  • @DouglasZare: In standard C, two dimensional arrays with variable number of raws and columns and using the T[n][j] syntax can only be implemented with an auxiliary table of pointers, and perceptible detrimental effect on performance compared to constant number of columns known at compile time. Expressions determining that constant can only involve integers, no functions except simulated as macros, even using intermediary variables is hairy. See thread of this message for how I painfully do E+ceil(12*sqrt(E)). – fgrieu Sep 07 '12 at 16:45

2 Answers2

11

A Poisson distribution with large mean is approximately normal, but you have to be careful that you want a tail bound and the normal approximation is proportionally less accurate near the tails.

One approach used in this MO question and with binomial distributions is to recognize that the tail decreases more rapidly than a geometric series, so you can write an explicit upper bound as a geometric series.

$$\begin{eqnarray}\sum_{k=D}^\infty \exp(-\mu)\frac{\mu^k}{k!} & \lt & \sum_{k=D}^\infty \exp(-\mu) \frac{\mu^D}{D!}\bigg(\frac \mu{D+1}\bigg)^{k-D} \\ & = & \exp(-\mu)\frac{\mu^D}{D!}\frac{1}{1-\frac{\mu}{D+1}} \\ & \lt & \exp(-\mu) \frac{\mu^D}{\sqrt{2\pi D}(D/e)^D} \frac{1}{1-\frac{\mu}{D+1}} \\ & = & \exp(D-\mu) \bigg(\frac{\mu}{D}\bigg)^D \frac{D+1}{\sqrt{2\pi D} (D+1-\mu)}\end{eqnarray} $$

Line 2 $\to$ line 3 was related to Stirling's formula. In practice I think you then want to solve $-p \log 2 = \log(\text{bound})$ numerically using binary search. Newton's method starting with an initial guess of $D = \mu + c \sqrt \mu.$ should also work.

For example, with $p=100$ and $\mu = 1000$, the numerical solution I get is 1384.89. A Poisson distribution with mean $1000$ takes the values from $0$ through $1384$ with probability $1-1/2^{100.06}.$ The values $0$ through $1383$ occur with probability $1-1/2^{99.59}.$

Douglas Zare
  • 10,658
  • 1
    +1. Another approach relates Poisson tail probabilities (on the right) to tail probabilities of Gamma distributions (on the left), which can be closely (over)estimated with a saddlepoint approximation. – whuber Sep 04 '12 at 12:20
  • There's a long way from that to something restricted to 64-bit integer arithmetic (without exp, log, sqrt..) but I will work on it; thanks all! – fgrieu Sep 04 '12 at 14:15
  • (+1) Up to the invocation of Stirling's approximation (which is irrelevant), this is exactly the bound I was (opaquely) referencing in my comment to the OP. (For example, see here.) – cardinal Sep 04 '12 at 20:16
2

You may see P. Harremoës: Sharp Bounds on Tail Probabilities for Poisson Random Variables https://helda.helsinki.fi/bitstream/handle/10138/229679/witmse_proc_17.pdf The main inequalities there are as follows. Let $Y$ be a Poisson random variable with parameter $\lambda$. Put $$G(x)= \sqrt{2\left(x\ln \frac{x}{\lambda} +\lambda-x\right)} \ \ {\rm sign} \left(x-\lambda\right).$$ Let $\Phi$ denote the cumulative distribution function for the standard normal law. Then, for all integer $k\ge 0$, $${\bf P}(Y<k)\le \Phi(G(k)) \le {\bf P}(Y\le k),$$ which is equivalent to $$\Phi(G(k-1)) \le {\bf P}(Y<k)\le \Phi(G(k))$$ for all integer $k>0$. Moreover, $\Phi(G(k+(1/2))) \le {\bf P}(Y\le k)$ which implies that $$\Phi(G(k-1/2)) \le {\bf P}(Y<k)\le \Phi(G(k))$$ for all integer $k>0$.

  • If you could write out the key equation (assuming there are only one or two) that would help in case the link goes dead at some time. – jbowman Nov 11 '18 at 03:00