5

I’m trying to informally derive the chi-squared test statistic using a combination of basic geometry and algebra. I’m successfully able to obtain a system of equations that contain Karl Pearson’s chi-squared test statistic. But I need help showing that the test statistic = chi^2 from my equations.

My approach:

I have a 3-sided die.

I roll the die a number of times and record the frequency of each face.

This system has 2 degrees of freedom (we only need to know the frequencies of any 2 faces to infer the frequency of the remaining one).

Therefore, we can describe the distance, chi, between the observed and expected values as a formula with 2 dimensions via the Pythagorean theorem: $$ \chi^2 \quad=\quad Z^2 \quad+\quad Z_{prime}^2 $$ ...where Z is the (standardized) difference between the observed and expected values for any one face, and Z_prime is the remaining side of our triangle in 2D space (Z_prime also implies a transformation of the distribution of the 2nd face from a joint distribution into an independent distribution, making the combined distribution circular).

Note that: $$ Z^2 \quad=\quad p.Z^2\quad+\quad(1-p).Z^2 $$

...and similarly: $$ Z_{prime}^2 \quad=\quad p.Z_{prime}^2\quad+\quad(1-p).Z_{prime}^2 $$

...therefore: $$ \chi^2 \quad= p.Z^2\quad+\quad(1-p).Z^2 + \quad p.Z_{prime}^2\quad+\quad(1-p).Z_{prime}^2 $$ So for all 3 faces (A, B, C) we have the following system of equations: $$ \chi^2 \quad= p_{A}.Z_{A}^2\quad+\quad(1-p_{A}).Z_{A}^2 + \quad p_{A}.Z_{A.prime}^2\quad+\quad(1-p_{A}).Z_{A.prime}^2 \\ \chi^2 \quad= p_{B}.Z_{B}^2\quad+\quad(1-p_{B}).Z_{B}^2 + \quad p_{B}.Z_{B.prime}^2\quad+\quad(1-p_{B}).Z_{B.prime}^2 \\ \chi^2 \quad= p_{C}.Z_{C}^2\quad+\quad(1-p_{C}).Z_{C}^2 + \quad p_{C}.Z_{C.prime}^2\quad+\quad(1-p_{C}).Z_{C.prime}^2 \\ $$

[Equations 1-3]

Now, since: $$ Z = \frac{(O-E)}{\sigma} $$

…and: $$ Z^2 = \frac{(O-E)^2}{np(1-p)} $$ …then, if we multiply Z^2 by (1-p) we get: $$ (1-p).\frac{(O-E)^2}{np(1-p)} = \frac{(O-E)^2}{np} $$ Therefore, the sum of the 2nd column from Equations 1-3 is identical to Pearson's chi-square test statistic for 2 degrees of freedom, i.e.: $$ (1-p_{A}).Z_{A}^2\quad +\quad (1-p_{B}).Z_{B}^2 \quad+ \quad(1-p_{C}).Z_{C}^2 \\= \frac{(O_{A}-E_{A})^2}{E_{A}}\quad+\quad\frac{(O_{B}-E_{B})^2}{E_{B}}\quad+\quad\frac{(O_{C}-E_{C})^2}{E_{C}} $$

My question is: how can I demonstrate from Equations 1-3 that: $$ \chi^2\quad=\quad(1-p_{A}).Z_{A}^2\quad +\quad (1-p_{B}).Z_{B}^2 \quad+ \quad(1-p_{C}).Z_{C}^2 $$ [Equation 4]

By the way, it's already straightforward to demonstrate that: $$ \chi^2 = \\ p_{A}.Z_{A}^2\quad+\quad p_{A}.Z_{A.prime}^2 + \\ p_{B}.Z_{B}^2\quad+\quad p_{B}.Z_{B.prime}^2 + \\ p_{C}.Z_{C}^2\quad+\quad p_{C}.Z_{C.prime}^2 \\ $$ ...since the sum of probabilities = 1.

Similarly, we can show that: $$ 2*\chi^2 = \\ (1-p_{A}).Z_{A}^2\quad+\quad (1-p_{A}).Z_{A.prime}^2 + \\ (1-p_{B}).Z_{B}^2\quad+\quad (1-p_{B}).Z_{B.prime}^2 + \\ (1-p_{C}).Z_{C}^2\quad+\quad (1-p_{C}).Z_{C.prime}^2 \\ $$ ...since the sum of (1-probabilities) = 2.

But I'm not sure that these identities help me arrive at Equation 4.

Rez99
  • 233

1 Answers1

2

The $\chi^2$-statistic can be derived in (at least) three ways.

  1. Consider $n-1$ dependent Binomial distributed variables.

The third case has been the approach by Pearson in 1900 and will be explained below in the case of the 3-sided dice.


Pearson considered multivariate normal distributions in terms of the Mahalanobis distance $\chi$:

$$f(x) \propto \exp \left(- \frac{\chi^2}{2} \right) \qquad \text{where $\chi^2 = (x-\mu)^t \Sigma^{-1} (x-\mu)$}$$

and when applying this to a multivariate normal approximation of the multinomial distribution only $n-1$ variables are considered instead of $n$. This is because the variables are distributed in a plane because of the constraint $\sum_{i=1}^3 O_i = N$. We can use the joint distribution of two variables $O_1$ and $O_2$, which are a sufficient statistic, and $O_3 = N - O_1 - O_2$.

The covariance matrix of the multinomial distribution, with 3 levels and probabilities $p_1,p_2,p_3$ is

$$\Sigma_{O_1,O_2,O_3} = N \begin{bmatrix} 1- p_1p_1 & -p_1p_2 & -p_1p_3 \\ - p_2p_1 & 1-p_2p_2 & -p_2p_3 \\ - p_3p_1 & -p_3p_2 & 1-p_3p_3 \end{bmatrix}$$

The covariance matrix of the normalized variables $Z_i = \frac{O_i-E_i}{\sqrt{N E_i/N(1-E_i/N)}}$ is

$$\Sigma_{Z_1,Z_2,Z_3} = \begin{bmatrix} 1 & -\sqrt{r_1r_2} & \sqrt{r_1r_2} \\ -\sqrt{r_2r_1} & 1 & \sqrt{r_2r_3} \\ -\sqrt{r_3r_1} & \sqrt{r_3r_2} & 1 \end{bmatrix}$$

Where we define $q_1 = 1- p_1$ and $r_1 = p_1/q_1$

Now, for the description of the density in terms of $\chi^2$ we only use two of the variables, e.g. $Z_1$ and $Z_2$ and their covariance matrix is:

$$\Sigma_{Z_1,Z_2} = \begin{bmatrix} 1 & -\sqrt{r_1r_2} \\ -\sqrt{r_2r_1} & 1 \\ \end{bmatrix}$$

whose inverse is

$$\Sigma_{Z_1,Z_2}^{-1} = \frac{1}{1-r_1r_2} \begin{bmatrix} 1 & \sqrt{r_1r_2} \\ \sqrt{r_2r_1} & 1 \\ \end{bmatrix}$$

Note that for the variables $Z_i$ the constraint is

$$\sqrt{p_1q_1}Z_1 + \sqrt{p_2q_2}Z_2 + \sqrt{p_3q_3}Z_3 = 0$$

and

$$\frac{\sqrt{p_1q_1}}{ \sqrt{p_3q_3}}Z_1 + \frac{\sqrt{p_2q_2}}{ \sqrt{p_3q_3}}Z_2 = -Z_3$$

And $\chi^2$ is equal to

$$\begin{array}{} \begin{bmatrix} Z_1 , Z_2 \end{bmatrix} \Sigma_{Z_1,Z_2}^{-1} \begin{bmatrix} Z_1 \\ Z_2 \end{bmatrix} &=& \frac{1}{1-r_1r_2} Z_1^2 + \frac{1}{1-r_1r_2} Z_2^2 + 2 \frac{\sqrt{r_1r_2}}{1-r_1r_2} Z_1Z_2 \\ &=& \left(q_1 + \frac{p_1q_1}{p_3}\right) Z_1^2 + \left(q_2+\frac{p_2q_2}{p_3}\right) Z_2^2 + 2 \frac{\sqrt{p_1p_2q_1q_r}}{p_3} Z_1Z_2 \\ &=& q_1 Z_1^2 + q_2 Z_2^2 + \left(\sqrt{\frac{p_1q_1}{p_3}} Z_1 + \sqrt{ \frac{p_2q_2}{p_3}}Z_2\right)^2\\ &=& q_1 Z_1^2 + q_2 Z_2^2 + \left(-\sqrt{q_3}Z_3\right)^2\\ &=& q_1 Z_1^2 + q_2 Z_2^2 + q_3 Z_3^2\\ \end{array}$$

  • Thank you for this very intuitive explanation. I appreciate the effort to keep the language as accessible as possible - this answer has really helped my understanding. – Rez99 Jun 17 '23 at 18:23
  • By way of clarification you mention “Pearson considered multivariate normal distributions in terms of the Mahalanobis distance χ” and that helped with my understanding of the derivation. But according to Wikipedia Pearson introduced the Chi square statistic in 1900, some 36 years before the Mahalanobis distance was formally introduced. So how did Pearson consider the Mahalanobis distance? – Rez99 Jun 17 '23 at 18:28
  • @Rez99 he used the same thing as what we call the Mahalanobis distance now. He considered the multivariate normal distribution as an elongated spherical shape and consider shells of constant probability density (ie where $\chi^2$ is constant). The Mahalanobis distance is actually a bit abbuse of terms by me, and that relates more to a distance measure between some point and the center of a population (but you can view it as well in the context of a multivariate normal distribution). – Sextus Empiricus Jun 17 '23 at 20:57
  • That makes sense, thank you for the clarification! – Rez99 Jun 23 '23 at 21:24
  • I published a post here on the chi-square test and pointed folks to this answer as a very clear derivation of the formula. Thanks again. – Rez99 Jun 28 '23 at 20:04