Goodness of fit of sorted data

Question

Say I have a list of 10000 sorted counts $\ge 0$ that sum to 20000 and I want to determine that they came from placing 20000 balls randomly, independently and uniformly into 10000 cells. A straight Chi-squared test would be reasonable if I assume independence, but 10000 counts of two would be blessed by chi squared, but I'd be somewhat skeptical of independence (randomness as well). I could compare the counts with a Binomial(20000,1/10000) ~ Poisson(2), but the counts are no longer independent, so a Chi-squared test is probably no longer reasonable even if I compute the covariance matrix for the counts.

Surely this has been dealt with before. (I'm interested in the case where the probabilities are non uniform, but want to understand the simple case first)

Added: Calculation of variances.

(All the cleverness in what follows is due to Robert Israel and André Nicolas.)

If we draw $n$ samples from a $\mathrm{Poisson}(\lambda)$ distribution then we expect to draw a count of $r$ $$E(X_r) = n \frac{\lambda^r}{r!}e^{-\lambda}$$ times. As the process of drawing a count of $r$ can be considered as $n$ independent Bernoulli trials, the variance would be $$\mathrm{Var}(X_r) = n \frac{\lambda^r}{r!}e^{-\lambda}\left(1-\frac{\lambda^r}{r!}e^{-\lambda}\right)$$.

Now, if we place $\lambda n$ balls in $n$ cells we will (as $n\to\infty$) get the same expected number of counts. The variance is somewhat more problematic. Let $Y_{r,i}$ be an indicator random variable representing the event that cell $i$ receives exactly $r$ balls. Then $X_r = \sum_i Y_{r,i}$. We have that $\mathrm{Var}(X_r) = E(X_r^2)-E(X_r)^2$. Splitting $X_r^2$ into a sum of $Y_{r,i}^2=Y_{r,i}$ (as $Y_{r,i}$ is either $0$ or $1$) and a sum of $Y_{r,i}Y_{r,j}$ over $i\ne j$. For $i\ne j$ we have $$E(Y_{r,i}Y_{r,j}) = \binom{\lambda n}{2r}\binom{2r}{r}\left(\frac{1}{2}\right)^{2r}\left(\frac{2}{n}\right)^{2r}\left(1-\frac{2}{n}\right)^{\lambda n - 2r}$$ So we have $$ \text{Var}(X_r) = n E(X_r) + n(n-1)E(Y_{r,i}Y_{r,j}) - (nE(X_r))^2$$ Taking the limit as $n\to\infty$ we get $$\text{Var}(X_r)=n\left(\frac{\lambda^r}{r!}e^{-\lambda}-\frac{\left(r^2\lambda^{2r-1}-(2r-1)\lambda^{2r}+\lambda^{2r+1}\right)}{(r!)^2}e^{-2\lambda}\right)$$ (taking the limit is fiddly, see Robert Israel's post for the simplest case).

This can be extended to compute the covariances (I'm working on it in my copious free time) and from there we can compute a better Chi-squared statistic. I am still not convinced that we cannot do better. I am also surprised that this is not a well known problem.

Update This is the final update (I have now gotten the algebra to cover the handwaving in the initial formulation.)

The covariance of the counts is $$\Sigma = n(D - p'p - q'q)$$

where $p = p_0,p_1,\dots$, and $q = q_0,q_1,\dots$ where $$p_i = \frac{\lambda^i}{i!}e^{-\lambda},$$ and $$q_i = \frac{i-\lambda}{\sqrt{\lambda}}\frac{\lambda^i}{i!}e^{-\lambda},$$ and $D$ is the diagonal matrix made of the elements of $p$.
Note that the covariances are far from 0, making the assumption of independence underlying the use of a straight $\chi^2$ test somewhat suspect.

This result is kind of cute as we can use the Sherman-Morrison-Woodbury method to find the inverse (of a truncation of the infinite matrix as the inverse diverges.)

We can use the fact that if an $n$ dimensional random variable $X$ is distributed as $N(\mu,\Sigma)$ then $(x-\mu)\Sigma^{-1}(x-\mu)$ is a $\chi^2_n$ random variable. As the numbers of each count is essentially a binomial R.V. we can pretend that at least for the small counts they are normally distributed. This sugests using $(x-\mu)\Sigma^{-1}(x-\mu)$ as a test statistic, where $x$ is the number of small counts. Ignoring the larger counts has the nice side effect of making the problem full dimensional. If we do this in practice we see that if the number of bins we use is small enough, the result is nearly $\chi^2_n$ distributed.

p = sapply(0:7,function(i){2^i*exp(-2)/factorial(i)})
q = sapply(0:7,function(i){2^i*exp(-2)*(i-2)/(factorial(i)*sqrt(2))})
poiscov = diag(p) - (p %o% p + q %o% q)
poiscovinv = solve(10000 * poiscov)
gendat = function(){tabulate(tabulate(sample(1:10000,20000,replace=TRUE),
                              nbins=10000) + 1,nbins=8)}
dat = t(replicate(10000, gendat()))
mn = 10000 * p
ch = apply(dat,1,function(v){(v-mn) %*% poiscovinv %*% (v-mn)})

The mean and variance look about correct

> mean(ch)
[1] 8.092341
> var(ch)
[1] 17.10295

It passes the eyeball goodness of fit test.

> library(MASS)
> truehist(ch)
> x = seq(0,40,length=100)
> lines(x,dchisq(x,8))

Distribution of purported chi-squared statistics

It turns out, unsurprisingly, that the distribution looks incrasingly unlike a $\chi^2_n$ distribution as $n$ increases, the tail becomes much heavier, and I am not certain what to do about that. I have tried adding all of the counts in the tail to the final bin, but that does not seem to help.

I cannot say that this test is definitely more powerful, than the straight Chi squared test, as I can find points that are accepted by one and not the other, but somehow this seems to take into account more of the structure of the problem. Intuitively this method takes into account the shape of the contours of the probability distribution. It not surprisingly places a fair amount of weight on the condition on the sum of the bins, and actually weights the higher numbered bins less than the smaller bins (the opposite of the chi-squared test).

The dependence is so incredibly weak--it's one linear constraint--that you'll find it's not relevant. To check, you can easily simulate this process and perform the test against a Poisson assumption. — whuber, Nov 28 '11 at 20:51
@whuber The results are more like a poisson distribution than 10000 samples from a poisson distribution. For instance randomly sampling from a poisson(2) distribution you would expect $n e^{-2}$ empty cells with a variance of $ne^{-2}(1-e^{-2})$ while the counts have the same mean but a variance of $ne^{-2}(1-3e^{-2})$ which is noticeably smaller. I will try simulating with chi squared using the computed covariance matrix (I was hoping to avoid the algebra necessary to compute them.) — deinst, Nov 28 '11 at 21:37
What is the basis of those calculations? Why do you suppose they differ from the simulations I have performed, which suggest the distribution is extremely close to Poisson? — whuber, Nov 28 '11 at 22:04
@whuber It will be extremely close to poisson, in fact closer than a randomly generated poisson. Run your simulation and watch the variances of each of your seven cells, and then run multiple selections of 10000 draws from a poisson(2) and watch the variances. In cell 2 the variances will be equal, but in all the others the variance is less (I think that this is the source of your two missing degrees of freedom.) The analysis is a bit fiddly, I'll try to add it tonight. — deinst, Nov 28 '11 at 22:17
That's a great point about the variances. If the distribution were generally too close to Poisson, the histogram of simulated p-values should increase to the right. (Which in fact is what it does if we use too many DF.) So the discrepancy might in fact be accounted for by adjusting the DF, as you suggest. It is encouraging that such a simple fix appears to overcome the effects you calculate. — whuber, Nov 28 '11 at 22:23

score 3 · Answer 1 · answered Nov 28 '11 at 22:01

Simulation solves the problem generally, not just for the uniform distribution case. The actual distribution is monomial, making the Poisson distribution an excellent approximation. You can test the fit with chi-square, bearing in mind that this is an approximation anyway. Its approximate nature is evident in the following results based on simulating the procedure 100,000 times. Specifically, in each trial 20000 balls were uniformly and independently placed into 10,000 bins. The bin counts were tallied and those were summarized. For example, one trial's outcome turned out to be

Bin count,  Tally,  Poisson fit
0      1328 1353
1      2730 2707
2      2709 2707
3      1813 1804
4       895  902
5       368  361
6       118  120
7        30   34.4
>=8       9   11.0
-----------
Total 10000

These bins (0..7 and >=8) were fixed for all trials so that all the chi-square statistics would have the same degrees of freedom. Notice how good the Poisson fit is.

The histogram below shows the distribution of the 100,000 resulting chi-square p-values, using 7 degrees of freedom (even though there are 9 bins). This is a little unusual--I expected 8 df, but I attribute the loss of two degrees of freedom to two reasons. First, the count for 0 is not based on any observations: it is deduced from all the positive counts by knowing the total is 10000. Second, the Poisson fit is based on the ratio 20000/10000, which is independent of the value of 10000 used to compute the zero counts. Regardless of why this loss of 2 df occurs, this convention works, as the histogram makes evident: the distribution of p-values is almost as uniform as one would like.

There is a slight tendency for a few very small p-values to occur, as we can see by zooming into the extreme left of this histogram (p < 0.005); the counts ought to be near 50 but they tend towards 60 as p approaches 0:

This does not seem severe enough to worry about.

A comparable simulation for a non-uniform process can be used in the same manner. This time, the distribution of chi-square statistics likely will not be chi-square, but (in the standard way) you can use its quantiles as thresholds for a hypothesis test. The properties of this test should be comparable to the properties of any chi-square test of distributional fit.

score 1 · Answer 2 · answered Nov 28 '11 at 19:42

1

I would construct my own test, and calculate the distribution of the test statistic under $H_0$ using simulation. This gets around the problem of small expected values in each cell, which will cause problems for tests that are only asymptotically correct as cell counts go to infinity, like the chi-square.

For example, you could construct a test statistic $T = \sum_{i=1}^{20000}|O_i-2|$ where $O_i$ is the observed count in cell $i$ and 2 is of course the expected count in each cell (I've simplified this expression from the general case.) You obviously can use functions other than the absolute value instead. Then in some programming language such as R you could generate, say, 10,000 samples from the null hypothesis distribution, and for each sample $j$ calculate $T_j$. You would then calculate $T_0$, the test statistic on the observed data set, and compare it to the values of $T_j$; if it's way out in the tails of the simulated distribution of the $T_j$, for example below the $2.5^{th}$ percentile or above the $97.5^{th}$ percentile, you'd reject the null hypothesis that the observations came from the null hypothesis distribution.

By making the test two-sided, as in this example, you would catch "too good to be true" fits such as your example of all cell counts = 2. You can also do this with a chi-square or many other tests by looking at the lower tail of the chi-square (or other appropriate) distribution as well as the upper tail.

The distribution you are actually dealing with is a Dirichlet distribution, btw.

answered Nov 28 '11 at 19:42

jbowman

38,614

1

It would be good to clarify what the quantity is that you believe has a Dirichlet distribution. – cardinal Nov 28 '11 at 20:17
2

The proposed statistic may have very low power against many kinds of natural alternatives. The problem is that we don't expect most of the cells to have exactly 2 occupants: only about a quarter of them should have 2. About 40% will have fewer than 2 and the rest will have more than 2. What is needed here, then, is theoretical justification for the proposed statistic. It isn't enough just to pull one out of a hat (even though it seems natural and appropriate) and suppose it will work well! – whuber Nov 28 '11 at 21:18
@whuber Thank you for saying what I have been trying to formulate. Unfortunately, or maybe fortunately for statistics, I am not a statistician. I would expect a test that contains as few points as possible, and I would expect it to be centered around the counts for a poisson(2) distribution as it would contain fewer points than a shell around the uniform distribution. – deinst Nov 28 '11 at 21:27
The cells are exchangeable, though, so they do all have the same expected value.

jbowman

Nov 28 '11 at 22:16

That is correct, but it doesn't seem to be of much use in formulating a good hypothesis test. The exchangeability implies that a minimal sufficient statistic is just the tallies of the counts rather than all the individual counts. The analysis should begin there (which is a huge simplification, because typically we're dealing with 10 or so distinct counts rather than 10,000 numbers). – whuber Nov 28 '11 at 22:20

You're right, I've realized your approach is much better than mine. – jbowman Nov 29 '11 at 18:02

Goodness of fit of sorted data

2 Answers2