Why does chi-square testing use the expected count as the variance?

Question

In $\chi^2$ testing, what's the basis for using the square root of the expected counts as the standard deviations (i.e. the expected counts as the variances) of each of the normal distributions? The only thing I could find discussing this at all is http://www.physics.csbsju.edu/stats/chi-square.html, and it just mentions Poisson distributions.

As a simple illustration of my confusion, what if we were testing whether two processes are significantly different, one that generates 500 As and 500 Bs with very small variance, and the other that generates 550 As and 450 Bs with very small variance (rarely generating 551 As and 449 Bs)? Isn't the variance here clearly not simply the expected value?

(I'm not a statistician, so really looking for an answer that's accessible to the non-specialist.)

This probably has something to do with the fact that the variance of a $\chi^{2}_{k}$ random variable is $2k$ and also with the fact that the statistic must be multiplied by 2 to have the correct distribution (as in the likelihood ratio test). Perhaps someone knows about this more formally. — Macro, Aug 25 '11 at 11:42

cardinal · Answer 1 · 2011-08-26T01:57:38.387

Let's handle the simplest case to try to provide the most intuition. Let $X_1, X_2, \ldots, X_n$ be an iid sample from a discrete distribution with $k$ outcomes. Let $\pi_1,\ldots,\pi_k$ be the probabilities of each particular outcome. We are interested in the (asymptotic) distribution of the chi-squared statistic $$ X^2 = \sum_{i=1}^k \frac{(S_i - n \pi_i)^2}{n\pi_i} \> . $$ Here $n \pi_i$ is the expected number of counts of the $i$th outcome.

A suggestive heuristic

Define $U_i = (S_i - n\pi_i) / \sqrt{n \pi_i}$, so that $X^2 = \sum_i U_i^2 = \newcommand{\U}{\mathbf{U}}\|\U\|^2_2$ where $\U = (U_1,\ldots,U_k)$.

Since $S_i$ is $\mathrm{Bin}(n,\pi_i)$, then by the Central Limit Theorem, $$ \newcommand{\convd}{\xrightarrow{d}}\newcommand{\N}{\mathcal{N}} T_i = \frac{U_i}{\sqrt{1-\pi_i}} = \frac{S_i - n \pi_i}{\sqrt{ n\pi_i(1-\pi_i)}} \convd \N(0, 1) \>, $$ hence, we also have that, $U_i \convd \N(0, 1-\pi_i)$.

Now, if the $T_i$ were (asymptotically) independent (which they aren't), then we could argue that $\sum_i T_i^2$ was asymptotically $\chi_k^2$ distributed. But, note that $T_k$ is a deterministic function of $(T_1,\ldots,T_{k-1})$ and so the $T_i$ variables can't possibly be independent.

Hence, we must take into account the covariance between them somehow. It turns out that the "correct" way to do this is to use the $U_i$ instead, and the covariance between the components of $\U$ also changes the asymptotic distribution from what we might have thought was $\chi_{k}^2$ to what is, in fact, a $\chi_{k-1}^2$.

Some details on this follow.

A more rigorous treatment

It is not hard to check that, in fact, $\newcommand{\Cov}{\mathrm{Cov}}\Cov(U_i, U_j) = - \sqrt{\pi_i \pi_j}$ for $i \neq j$.

So, the covariance of $\U$ is $$ \newcommand{\sqpi}{\sqrt{\boldsymbol{\pi}}} \newcommand{\A}{\mathbf{A}} \A = \mathbf{I} - \sqpi \sqpi^T \>, $$ where $\sqpi = (\sqrt{\pi_1}, \ldots, \sqrt{\pi_k})$. Note that $\A$ is symmetric and idempotent, i.e., $\A = \A^2 = \A^T$. So, in particular, if $\newcommand{\Z}{\mathbf{Z}}\Z = (Z_1, \ldots, Z_k)$ has iid standard normal components, then $\A \Z \sim \N(0, \A)$. (NB The multivariate normal distribution in this case is degenerate.)

Now, by the Multivariate Central Limit Theorem, the vector $\U$ has an asymptotic multivariate normal distribution with mean $0$ and covariance $\A$.

So, $\U$ has the same asymptotic distribution as $\A \Z$, hence, the same asymptotic distribution of $X^2 = \U^T \U$ is the same as the distribution of $\Z^T \A^T \A \Z = \Z^T \A \Z$ by the continuous mapping theorem.

But, $\A$ is symmetric and idempotent, so (a) it has orthogonal eigenvectors, (b) all of its eigenvalues are 0 or 1, and (c) the multiplicity of the eigenvalue of 1 is $\mathrm{rank}(\A)$. This means that $\A$ can be decomposed as $\A = \mathbf{Q D Q}^T$ where $\mathbf{Q}$ is orthogonal and $\mathbf{D}$ is a diagonal matrix with $\mathrm{rank}(\A)$ ones on the diagonal and the remaining diagonal entries being zero.

Thus, $\Z^T \A \Z$ must be $\chi^2_{k-1}$ distributed since $\A$ has rank $k-1$ in our case.

Other connections

The chi-square statistic is also closely related to likelihood ratio statistics. Indeed, it is a Rao score statistic and can be viewed as a Taylor-series approximation of the likelihood ratio statistic.

References

This is my own development based on experience, but obviously influenced by classical texts. Good places to look to learn more are

G. A. F. Seber and A. J. Lee (2003), Linear Regression Analysis, 2nd ed., Wiley.
E. Lehmann and J. Romano (2005), Testing Statistical Hypotheses, 3rd ed., Springer. Section 14.3 in particular.
D. R. Cox and D. V. Hinkley (1979), Theoretical Statistics, Chapman and Hall.

(+1) I think it is hard to find this proof in standard categorical data analysis texts like Agresti, A. (2002). Categorical Data Analysis. John-Wiley. — suncoolsu, Aug 26 '11 at 02:19
Thanks for the comment. I know there is some treatment of the chi-squared statistic in Agresti, but don't recall how far he takes it. He may just appeal to the asymptotic equivalence with the likelihood ratio statistic. — cardinal, Aug 26 '11 at 11:13
I don't know if you'll find the proof above in any text. I haven't seen the use of the full (degenerate) covariance matrix and its properties elsewhere. The usual treatment looks at the (nondegenerate) distribution of the first $k-1$ coordinates and then uses the inverse covariance matrix (which has a nice form, but one which is not immediately obvious) and some (somewhat) tedious algebra to establish the result. — cardinal, Aug 26 '11 at 11:14
Your answer begins by defining a set of $X$'s but then defines the statistic in terms of $S$'s. Can you include something in the answer that indicates how the variables you define at the start and the variables in the statistic are related? — Glen_b, Sep 15 '16 at 06:12

score 19 · Accepted Answer · edited Aug 26 '11 at 02:50

19

The general form for many test statistics is

$\frac{observed - expected}{standard error}$

In the case of a normal variable the standard error is based on either the known population variance (z-stats) or the estimate from the sample (t-stats). With the binomial the standard error is based on the proportion (hypothesized proportion for tests).

In a contingency table the count in each cell can be thought of as coming from a Poisson distribution with a mean equal to the expected value (under the null). The variance for the Poisson distribution is equal to the mean, so we use the expected value for the standard error calculation as well. I have seen a statistic that uses the observed instead, but it has less theoretical justification and does not converge as well to the $\chi^2$ distribution.

edited Aug 26 '11 at 02:50

Jeromy Anglim

44,984

answered Aug 25 '11 at 14:57

Greg Snow

51,722

2

I'm getting stuck on the connection with the Poisson / understanding why each cell can be thought of as coming from a Poisson. I know the mean/variance of Poissons, and I know they represent the number of events given a rate. I also know chi-square distributions represent the sum of squares of standard (variance 1) normals. I'm just trying to wrap my head around the justification of re-using the expected value as an assumption of the "spread" of each of the normals. Is this just to make everything conform to the chi-square distribution / to "standard-ize" the normals? – xyzzyrz Aug 25 '11 at 19:14
4

There are a couple of issues, the Poisson distribution is common for counts when things are fairly independent. Instead of thinking about the table as having a fixed total and you are distributing the values between the cells of the table, think about just one cell of the table and you are waiting a fixed amount of time to see how many responses fall into that cell, this fits with the general idea of the Poisson. For large means you can approximate a Poisson with a normal distribution, so the test statistic makes sense as a normal approximation to the Poisson, then convert to $\chi^2$. – Greg Snow Aug 25 '11 at 19:38
1

(+1) Suppose the cell counts $X_i,\ldots,X_k$ were independent Poisson random variables with mean $n\pi_i$. Then, certainly, $\sum_{i=1}^k \frac{(X_i - n\pi_i)^2}{n \pi_i} \to \chi_k^2$ in distribution. But, the problem with this is that $n$ is a *parameter* and not the actual observed counts. The total observed counts are $N = \sum_{i=1}^k X_i \sim \mathrm{Poi}(n)$. Though $N/n \to 1$ almost surely by the SLLN, some more work has to be done to turn the heuristic into something workable. – cardinal Aug 25 '11 at 19:44
As a simple illustration of my confusion, what if we were testing whether two processes are significantly different, one that generates 500 As and 500 Bs with very small variance, and the other that generates 550 As and 450 Bs with very small variance (rarely generating 551 As and 449 Bs)? Isn't the variance here clearly not simply the expected value? – xyzzyrz Aug 26 '11 at 02:32
1

@Yang: It sounds like your data---which you haven't described---do not conform to the model underlying the use of the chi-squared statistic. The standard model is one of multinomial sampling. Strictly speaking, not even (unconditional) Poisson sampling is covered, which is what Greg's answer supposes. I make (a perhaps obtuse) reference to this in my previous comment. – cardinal Aug 26 '11 at 11:19
@cardinal Thanks for pointing that out. I occasionally forget this and return to this question, re-discovering your last comment which pinpoints my mistake. – xyzzyrz Jan 15 '12 at 08:19

Sextus Empiricus · Answer 3 · 2023-05-29T19:33:59.043

Why does chi-square testing use the expected count as the variance?

You can make the jump from standardized residuals$$\epsilon_i = \frac{O_i-E_i}{\sqrt{N(E_i/N)(1-E_i/N)}}$$ to the terms as used in the $\chi^2$ expression $$x_i = \frac{O_i-E_i}{\sqrt{E_i}}$$ This is the topic of the question Obtaining the chi-squared test statistic via geometry and the approach by the answer from Aksakal and the article by Pearson from 1900.

However, it might be easier to imagine the multinomial distribution as the joint distribution of Poisson distributed variables, $O_i \sim Poisson(Np_i)$, constrained by the total sum, $T = \sum_i O_i \sim Poisson(N)$ being equal to $N$.

The unconstrained joint distribution is

$$P(O_1=o_1,\dots,O_n=o_n) = \prod_{i=1}^n \frac{{(Np_i)}^{o_i} e^{-Np_i}}{o_i!}$$

and the constrained distribution is

$$\begin{array}{} P(O_1=o_1,\dots,O_n=o_n| T=m)& = &\frac{P(O_1=o_1,\dots,O_n=o_n, T=N)}{P(T=N)} \\ &=& \frac{P(O_1=o_1,\dots,O_n=o_n)}{P(T=N)}\\& = &\frac{\prod_{i=1}^n \frac{{(Np_i)}^{o_i} e^{-Np_i}}{o_i!}}{\frac{{N}^{N} e^{-N}}{N!}} \\&=& \frac{N! }{\prod_{i=1}^n o_i!}\prod_{i=1}^n{p_i}^{o_i}\end{array}$$
When $N$ is large then we can approximate the Poisson distribution with a multivariate normal distribution with the same constraint.
Following that we can normalize that multivariate normal distribution by using the divisions with $\sqrt{E_i}$ (which are not standard deviations of the multinomial distribution but they are the standard deviations of the joint Poisson distribution)
The divisions with $\sqrt{E_i}$ changes the constraint $\sum_i O_i = N$ into a different (but still linear) constraint $\sum_i x_i \sqrt{E_i} = \sum_i O_i - E_i = 0$.
The standardized multivariate normal distribution is spherically symmetric and the constraint $\sum_i x_i \sqrt{E_i} = 0$ is similar to constraining only a single variable.
The distribution of the sum of squares of the remaining $n-1$ variables is $\chi^2$ distributed with $n-1$ degrees of freedom.

Isn't the variance here clearly not simply the expected value?

As a simple illustration of my confusion, what if we were testing whether two processes are significantly different, one that generates 500 As and 500 Bs with very small variance, and the other that generates 550 As and 450 Bs with very small variance (rarely generating 551 As and 449 Bs)? Isn't the variance here clearly not simply the expected value?

The chi-squared test refers to multinomial distributed data or to count data, where each event, a specific count falling into a specific bin, is independent from the others.

A process that generates these large numbers with such small variance, is likely a different type of process and the chi-squared test is not applicable.

For the Poisson distribution and the Binomial/multinomial distribution, the mean and the variance are related and both can be estimated with a single observed count variable. With other distributions this is not the case and you need several observations from which you can estimate the variance.

Why does chi-square testing use the expected count as the variance?

3 Answers3

Why does chi-square testing use the expected count as the variance?

Isn't the variance here clearly not simply the expected value?

Linked