0

Background

Suppose we observe $n$ IID Bernoulli variables and our null hypothesis is that their common probability is $p$. For denote by $\mathbb{1}_{\{i\}}$ the outcome of observation $i$.

Then by the central limit theorem

$\frac{\frac{1}{\sqrt{n}}\sum_{i=1}^n (\mathbb{1}_{\{i\}} - p)}{\sqrt{p \cdot (1 - p)}} \rightarrow N(0, 1),$ which can be used for hypothesis testing.

Suppose now that the null hypothesis is instead that each variable has an individual probability of success, $p_i$ (they are still independent). Then a simple argument allows us to use Lyapunov's version of the CLT and can thus conclude

$\frac{\sum_{i=1}^n (\mathbb{1}_{\{i\}} - p_i)}{\sqrt{\sum_i p_i \cdot (1 - p_i)}} \rightarrow N(0, 1),$ which can then be used to test this composite hypothesis.

Question

If instead we have $k$ categories and our null hypothesis is that the probabilities for each is $p_k$ and we have observed the $n_i$ occurrences of each outcome $i$ then we can use the Chi-Square Goodness of Fit test stating that if the $n_i$ sum to $n$ then

$\sum_{i=1}^k \frac{(n_i - n \cdot p_i)^2}{n \cdot p_i} \rightarrow \chi^2(k - 1)$.

Analogously to above, instead I want to form a null hypothesis where I conduct $n$ experiments, but for each of them the $k$ categories have separate probabilities $\{(p_1^1, p_2^1, \ldots p_k^1), (p_1^2, p_2^2, \ldots p_k^2), \ldots, (p_1^n, p_2^n, \ldots p_k^n) \}.$

Is there a generalization of the Chi-Square Goodness of Fit test applicable to test this kind of hypothesis? Looking briefly at the proof the case for the standard case gives me a feeling that it should be possible, but surely I can't be the first one asking this?

  • 2
    This is the standard test. You merely have arranged $nk$ bins into a rectangular array and there are $n$ independent linear constraints on the rows of the array. – whuber Apr 22 '22 at 16:33
  • @whuber Hm, ok. Checking that I understood you correctly: You mean I view it as I have $nk$ categories and compute the $\chi^2$-statistic where I have a single observation per category (equal to either zero or one) and the null hypothesis is that this is distributed as $\chi^2 (nk - n)$? – Christian Apr 22 '22 at 16:46
  • 1
    That's correct. – whuber Apr 22 '22 at 16:50
  • Thanks. Clearly a sign of my not understanding the test in details since I just swallowed the standard assumption that the probabilities for the respective bins must sum to one. – Christian Apr 22 '22 at 16:53
  • I go into this in more detail at https://stats.stackexchange.com/a/17148/919, laying out the necessary conditions and providing a cautionary example. – whuber Apr 22 '22 at 17:32
  • @whuber Sure? Did the following experiment to get a feeling:

    3 different outcomes, hypothesized probability 1/3, 1/3, 1/3. Actual probability 1/3 + epsilon, 1/3, 1/3 - epsilon.

    Draw such a variable n times (from the correct distribution), compute the p-value corresponding to chi^2 in two different ways.

    1: Counts and expected counts in 3 different bins, distribution with 3-1=2 degrees of freedom. 2: Form grid of n3 bins and form counts and expecteds in them, distribution with n3 - n degrees of freedom.

    1 yields extreme p-val for n=1000, 2 yields p close to 0.5 regardless of n

    – Christian Apr 25 '22 at 13:14

1 Answers1

0

Could not find an answer to this as a theorem so had to formulate and prove one by myself.

I have $n$ variables $\{X_{i}\}_{i=1}^n$ each attaining a value in $\{1, \ldots k\}$ and my null hypothesis is that $P(X_i = j) = p_i^j$

Then for each $j \in \{ 1, \ldots k \}$ Lyapunov's version of CLT yields

$Z_j := \frac{\sum_{i=1}^n (\mathbb{1}_{\{X_i = j\}} - p_i^j)}{\sqrt{\sum_{i=1}^n p_i^j \cdot (1 - p_i^j)}} \rightarrow N(0, 1).$

So we have $k$ standard normal variables and to proceed we need to figure out their covariance matrix, $\text{cov}(\overline{Z})$.

Straightforward computations yield

$\text{cov}(Z_r, Z_s) = \left\{ \begin{array}{ll} 1 && \text{if } r = s\\ \frac{-\sum_{i = 1}^{n} p_i^r \cdot p_i^s}{\sqrt{(\sum_{i = 1}^{n} p_i^r \cdot (1 - p_i^r)) \cdot (\sum_{i = 1}^{n} p_i^s \cdot (1 - p_i^s))}} && \text{otherwise} \end{array} \right.$

By construction $\text{cov}(\overline{Z})$ is symmetric and real hence by the spectral theorem there are orthonormal eigenvectors of it forming a basis of $\mathbb{R}^k$. Letting $V$ be the matrix of such eigenvectors and $D$ the diagonal matrix of corresponding eigenvalues we have

$\text{cov}(\overline{Z}) = V D V^{\top}$.

Furthermore it's easily seen that

$\text{cov}(V^{\top} \overline{Z}) = V^{\top}\text{cov}(\overline{Z}) V = D$.

Denote the elements of $V^{\top} \overline{Z}$ by $(\xi_i)_i$, the elements of $D$ by ${(d_{i,j})}_{i, j}$ (of course $d_{i,j} = 0$ for $i \neq j$) and define the vector

$\overline{W} = (\frac{\xi_i}{\sqrt{d_{i,i}}} )_{\{i: d_{i,i} \neq 0\}}$.

It's now clear that the elements of $\overline{W}$ are independent standard normals, with length $L = \text{rank}(\text{cov}(\overline{Z}))$.

Hence the distribution of $\| \overline{W} \|^2$ is $\chi^2(L)$ according to the null hypothesis and this squared norm is what we'll compute when conducting our testing.