1

I am solving a problem : The number of new customers in the mall each day follows Poisson distirbution with $\lambda = 50$. Find approximately the probability that after one year (200 working days) the number of customers that visited was between 950 and 1100.
My idea : Day i we have $X_i \sim P_o(\lambda) $ new customers . Now we need $P(950 \leq \sum_{i=1}^{200} X_i\leq 1100 )$.
$X_i$'s are independent and equally distirbuted. We have $200 > 30$ samples and therefore Central Limit Theorem should hold: $$ Z=\frac{S_{200}-200\lambda}{\sqrt{200\lambda} } \sim N(0,1)$$ Then : $P(950 \leq \sum_{i=1}^{200} X_i\leq 1100 ) = P(\frac{950-200*50}{\sqrt{200*50} } \leq Z \leq \frac{1100-200*50}{\sqrt{200*50} })$ And I could actually apply continuity correction and have instead: $$ P(\frac{949.5-200*50}{\sqrt{200*50} } \leq Z \leq \frac{1100.5-200*50}{\sqrt{200*50} }=P(-90.5 \leq Z \leq -88.9)=\Phi(-88.9) - \Phi(-90.5) = 1-\Phi(88.9) - 1 + \Phi(90.5)= \Phi(90.5)-\Phi(88.9)$$
Why are they so huge ? At least in theory , CLT should hold , the conditions were satisfied.

Side note : We have at average 50 people per day and also 50 deviation . So let's suppose we have 50 and 0 people/per day alternatively --> after 200 days we should have 100* 50 = 5000 so less than 1100 seems small (if this kind of thought makes sense)

  • 1
    If you expect 50 visitors per day, then you would expect 50 times 200 = 10000 per year. This is far away from the number in question. – Michael M Sep 11 '22 at 09:49
  • @MichaelM yes but we have to consider also the deviation which is equal to the average I guess – tonythestark Sep 11 '22 at 10:02
  • 1
    " We have 200>30 samples samples and therefore Central Limit Theorem should hold" --- what's the basis on which "n>30" is relevant? (The CLT itself says nothing of the kind; is there some other explicit basis on which we should agree than n>30 establishes anything in particular?) – Glen_b Sep 11 '22 at 10:50
  • 1
    Why, for example, would 10 not be sufficient, or 150 not be needed? – Glen_b Sep 11 '22 at 10:54
  • @Glen_b there is a common practical rule https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz – tonythestark Sep 11 '22 at 10:55
  • @Glen_b obviously the larger the sample size the better the approximation , 150 would give better results I think – tonythestark Sep 11 '22 at 10:56
  • 1
    That a rule is commonly found does not of itself suggest there's any basis to think it establishes anything -- this is why I asked about justification rather than popularity. In particular, with the Poisson, note that a Poisson(10000) is exactly the same distribution whether we consider it to be a sum of n=200 Poisson(50)s or n=50 Poisson(200)s or n=10 Poisson(1000)s or n=10000 Poisson(1)s or n=1 Poisson(10000). In similar fashion, a Poisson(1) can be n=1 Poisson(1)s or n=100,000 Poisson(1/100000)s. Clearly, then, the specific n tells us nothing; only the value of $n\lambda$ is relevant here – Glen_b Sep 11 '22 at 11:02
  • Consequently, from the properties of the Poisson itself (and in similar fashion, any other such infinitely divisible distribution, like the gamma, say) we immediately have a strong reason to doubt any specific n is sufficient for anything in general. – Glen_b Sep 11 '22 at 11:04
  • so how would you answer the question? about the approximation of this probability – tonythestark Sep 11 '22 at 11:06
  • I recommend reading the first sentence of the most upvoted answer at the link in your comment (which is not the accepted answer). – Glen_b Sep 11 '22 at 12:20

2 Answers2

1

You're looking at a case where (assuming independence, though it would seem to be a somewhat questionable assumption), the distribution of the number of visitors in $200$ days has $\mu=10000$ and $\sigma=100$. That means that the values we're looking at are in the ballpark of 90 standard deviations below the mean.

The question asked about an event that was in that vicinity so fact that the z-values are around $-90$ is simply because that's the event the question chose to ask about.

This is just a straight calculation based on the values in the question; how many standard deviations from the mean we are looking of itself has nothing to do with the CLT (whether it 'holds' or not).

However, one thing to keep clearly in mind is that the approach to normality in an average or sum may have very poor relative accuracy in the extreme tail of the cdf on the left or the survivor function on the right (i.e. tail areas may be very inaccurate in the far tail), even in cases where relative error in the central area of the cdf (within a few standard deviations from the mean) is quite good.

I would not expect the relative error in probabilities calculated this way to be remotely close in this situation (the absolute error will of course be very small because both the Poisson and the normal tail areas will be extremely small, for all that one might be many many orders of magnitude larger than the other).


My suspicion is that the person who wrote the question made a calculation error and thought that they were asking about values close to $\mu$ rather than values relatively near $z=-100$.

so how would you answer the question? about the approximation of this probability

If there was a specific need for anything better than "if the assumptions are reasonable, this probability will be vanishingly small" then I might consider trying to use the connection between the Poisson and the chi-squared, but even there, accurate calculations at 90ish sd's below the mean would be difficult. Other possibilities would be to use still other approximations (there's normal approximations for various transformations of the Poisson, or of the chi-squared, for example, from which we could attempt to get a rough tail area), that might do better than the usual normal approximation, but again, accurate answers will be quite difficult; and out that far tail approximations that are often quite useful may actually be worse.

I expect we could make an argument that the true probability will be smaller than the value from the normal approximation (I expect it will be much, much smaller), and so the normal approximation would be a kind of upper bound on that probability. We could then approximate that normal cdf via the approximation $\Phi(z) = \phi(z)/|z|$ which works in the far left tail.

This should at least suffice to make it clear that the value of the required probability must be extremely small.


It is possible to get R to give answers (on the log scale) this far down in the Poisson and the normal. These values are unlikely to be very accurate but they may perhaps give some sense of just how vastly different in magnitude the Poisson and the normal may become in the extreme tail:

> ppois(1100,10000,log=TRUE)
[1] -6476.302
> ppois(950,10000,log=TRUE)
[1] -6818.063
> pnorm(1100,10000,100,log=TRUE)
[1] -3965.908
> pnorm(950,10000,100,log=TRUE)
[1] -4100.549

Note that the Poisson values (log-cdf) are in the "vicinity" of $-6500$ and the normal approximation values are in the "vicinity" of $-4000$. We're looking at the normal areas being on the order of $\exp(2500)$ times larger than the thing they're approximating, which is of "similar" size to $10^{1000}$. (Here anything like similar or ballpark or vicinity is entirely unsuited to conveying how inaccurate these approximations are, my apologies; hence the use of scare quotes. Coming up with good expressions to actually motivate this without going back to mathematical statements is difficult.)

Glen_b
  • 282,281
  • so you mean that because of the event that the question asked which highly unlikely to happen we get so big values ?? Is the probability 0? – tonythestark Sep 11 '22 at 11:25
  • Sorry, I don't follow what you're asking there. The question (for whatever reason) asks about an interval between two values both somewhere near 1000, but 1000 happens to be 90 standard deviations below $\mu$. Asking why you're looking at $\Phi(-90.5)$ and $\Phi(-88.9)$ is simply no different from asking "why did the question ask for "the number of customers that visited was between 950 and 1100". It asks that because that's what it asked. Anything beyond that is speculation about what was in the mind of the asker. I have engaged in that speculation in my answer. – Glen_b Sep 11 '22 at 11:37
  • 1
    None of the probabilities are actually 0. They're just exceedingly small, even if we can't evaluate them accurately. Even the Poisson model itself will not be accurate this far down in the tail; the assumptions will simply not be exact enough in practice for the model to apply (the inexactness of the independence assumption alone would be enough to stymie us). – Glen_b Sep 11 '22 at 12:36
  • I am not sure how to compute the value's of phi for such large numbers, I know a method by a table but x never exceeds 3.99 there – tonythestark Sep 11 '22 at 17:47
  • The normal density function can be done on a calculator, but with such extreme tails you use the log of the density which is $-\frac12[x^2 +\log(2\pi)]$ and (since you divide by $|x|$ in the formula to approximate the extreme left tail of $\Phi$) you would then subtract $\log|x|$. But such figures as you'll get here are pretty meaningless. – Glen_b Sep 11 '22 at 21:55
  • For what it's worth, here's a comparison using R's implementation of $\Phi$ and this formula: pnorm(-90,log=TRUE) which gives [1] -4055.419 ... vs -(90^2+log(2*pi))/2-log(90) which gives [1] -4055.419. (there's a similarly simple formula for the upper tail). Beware using such a simple formula for small single-digit z values (like say much closer than z=-5), it's much less accurate then. – Glen_b Sep 11 '22 at 22:02
  • 1
    NB these negative numbers are log-probabilities, not probabilities. In this case you'd just work out the area to the left of the right end, and ignore the relatively miniscule (!) subtraction that the left end would give you. i.e. your extremely rough upper bound would just be to approximate the log of $\Phi(-88.9)$. To see where the suggested approximation comes from, you can get to it by using the first term in an asymptotic expansion for Mill's ratio (due to Laplace), but you can easily derive this first term from l'Hôpital's rule, see https://stats.stackexchange.com/a/511847/805 – Glen_b Sep 11 '22 at 22:36
  • 1
    You would not be expected to do any of this for an elementary class exercise; so you could quite reasonably just report that the probability is exceedingly small. – Glen_b Sep 11 '22 at 22:37
1

The set up of the question is absurd, as originally pointed out by Michael M, in that the expected number with $200$ days is $10000$ with a standard deviation of $100$, so you are extremely unlikely to see from $950$ to $1100$ and are extremely deep in the tail - being about $90$ standard deviations away from the expectation.

The actual probability is between $10^{-2813}$ and $10^{-2812}$ while a normal approximation would give a much smaller figure, being a good absolute approximation (as one might eventually hope for from the Central Limit Theorem) of near to zero, though poor relative approximation as often happens in the tails.

Let's instead deal with more realistic numbers away from the extreme tails and look at the probability of seeing from $9950$ to $10100$. It should not be difficult to find the exact probability, for example with R

sum(dpois(9950:10100, 10000))
# 0.5353328

The normal approximation is a continuous distribution while the Poisson is discrete on the integers, so $P(9950 \leq \sum\limits_{i=1}^{200} X_i\leq 10100 ) = P(9949 \lt \sum\limits_{i=1}^{200} X_i\leq 10101)$ in reality but not in your approximation. You could use a continuity correction and find the probability of seeing values between $9949.5$ and $10100.5$. This will lead you to $\Phi\left(\frac{10100.5 - 10000}{\sqrt{10000}}\right) -\Phi\left(\frac{9949.5 - 10000}{\sqrt{10000}}\right) = \Phi\left(1.005\right) -\Phi\left(-0.505\right)$ which R tells us is

pnorm(1.005) - pnorm(-0.505)
# 0.5357722

and is a reasonably good approximation both in absolute and relative terms.

Henry
  • 39,459