3

When testing deviations from an expected categorical distribution, the statistic $\Sigma\frac{(Obs-Exp)^2}{Exp}$ is commonly used and said that this statistic is distributed as $\chi^2$.

From the identity: $\chi^2 = \Sigma Z_i^2$, where $Z_i$ are standard normal distributions, if I'm not mistaken, $\frac{(Obs-Exp)}{\sqrt{Exp}}$ is assumed to be distributed according to standard normal.

To restate my question on title, how is the statistic $\frac{(Obs-Exp)}{\sqrt{Exp}}$ distributed as standard normal?

Macond
  • 488

3 Answers3

2

Let's consider a simple (typical) case -- the chi-squared goodness of fit test with $k$ categories.

You're correct that the sum of squares of $k$ independent standard normals will have a $\chi^2_k$ distribution.

It's also the case that if $D_i=\frac{O_i-E_i}{\sqrt{E_i}}$ then $\sum_{i=1}^k D_i^2$ will have (approximately) a chi-squared distribution.

However, the $D_i$ are neither independent nor identically distributed (nor are they actually normal; the approximation is asymptotic, but in finite samples the $D_i$ are discrete -- as is the chi-squared statistic based on them).

Note, for example, that $\sum_{i=1}^k D_i^2$ only has $k-1$ degrees of freedom, even though it is the sum of $k$ terms of the form $D_i^2$. This dependence is a consequence of fixing the margin (and has the consequence that $\sum_i E_i=\sum_i O_i$). If one of the $O_i-E_i$ values is positive, the sum of the others must be negative. The negative dependence corresponds to a loss of one degree of freedom from the asymptotic chi-square approximation (for a fully specified distribution).

Glen_b
  • 282,281
2

The rationale for using the test statistics of the form $Q = \sum_i\frac{(X_i-E_i)^2}{E_i}$ is that counts $X_i$ in level $i$ of a categorical variable (univariate or mulivariate) are roughly Poisson-distributed with rate $\lambda_i$. Then $E(X_i) = Var(X_i)$ $\approx \lambda_i,$ so that $$Z_i = (X_i - E_i)/\sqrt{E_i} \stackrel{aprx}{\sim} \mathsf{Norm(0,1)}$$ for sufficiently large $\lambda_i$ (estimated by $E_i$) and $Z_i^2$ is approximately $\mathsf{Chisq}(\nu = 1).$

If the contributions $Z_i^2$ to the "chi-squared" statistic $Q$ were independent, then $Q$ would be distributed approximately $\mathsf{Chisq}(\nu = K),$ where $K$ is the number of levels of the categorical variable. However, the $E_i$ are typically estimated from one or more totals, imposing linear constraints and reducing the number of degrees of freedom.

There are several approximations involved. Perhaps the most important ones are that $X_i$ are Poisson, that $(X_i - \lambda_i)/\sqrt{\lambda_i}$ are nearly normal, that $E_i$ are good estimates of $\lambda_i,$ and that the linear restrictions are correctly accounted for in the degrees of freedom ascribed to $Q.$

Note: (1) The likelihood ratio test (involving products and logs) is more accurate and the theory leading to an approximate chi-squared distribution rests on firmer ground. But the test statistic is messier to compute, the procedure is more difficult to visualize, and there are no 'Pearson Residuals' to ponder for post hoc analysis if $Q$ is large enough to warrant rejection of the null hypothesis on which the $\lambda_i$ are based. (2) See also.

BruceET
  • 56,185
1

It converges to the normal, but not the standard normal: Asymptotically, that statistic converges in distribution to the chi-squared distribution as $n \rightarrow \infty$. Now, the chi squared distribution is a type of gamma distribution, and it converges to the normal distribution as $DF \rightarrow \infty$, which happens as as $n \rightarrow \infty$. So, the statistic you are looking at converges to the chi-squared distribution, which converges to the normal distribution. For large $n$ there is very little difference between these two.

However, the statistic does not converge in distribution to the standard normal distribution. Its mean and variance can be determined via the chi-squared approximation. The mean is strictly positive, since the statistic is a sum of (stochastic) squares.

Ben
  • 124,856