Using brute force to get exact p-value instead of Chi-Squared Test of Independence

Question

Background

I have the following contingency table

+------+-----------+-----------+
|      |    X      |  Not X    |
+------+-----------+-----------+-------+
|  Y   |    60     |     20    |  80   |
+------+-----------+-----------+-------+
| notY |    15     |     40    |  55   |
+------+-----------+-----------+-------+
       |    75     |     60    |  135  |
       +-----------+-----------+-------+

where I've included the row and column marginals as well as the table total.

I would like to know if the rows and columns are independent or not. I could just use a Pearson's Chi-Square Test for Independence to get the p-value, which in R is

our_table <- matrix(c(60,20,15,40),nrow=2,byrow=T)
pvalue <- chisq.test(our_table, correct=F)$p.value

This gives a p-value of $4.17\times10^{-8}$, which seems reasonable since this table does not look independent.

My Goal

As an exercise, I attempted to calculate the exact p-value for my contingency table by looking at every possible contingency table constrained to have 135 observations and see how many are as or more extreme than mine. Note: I don't fix the row or column marginals, just as in the Chi-Squared Test for Independence.

Pearson's Chi-Square is just an approximation which was derived from a (generalized) likelihood ratio test[1] so I will be working with this test directly.

Likelihood Ratio

The marginal probabilities for the rows and columns are $$\pi_{i\cdot}=\sum_{j=1}^2\pi_{ij}$$ and $$\pi_{\cdot j}=\sum_{i=1}^2\pi_{ij}$$ respectively, where $\pi_{ij}$ is the probability of an observation being in the i-th row and j-th column.

The null hypothesis is that the rows and columns are independent, i.e. $$\text{H}_0:\;\; \pi_{ij}=\pi_{i\cdot}\pi_{\cdot j},\;\; \forall i,j\;\;,\tag{1}$$ and under the null hypothesis, the MLEs for the $\pi_{ij}$ are given by[2] $$\hat{\pi}_{ij}=\hat{\pi}_{i\cdot}\hat{\pi}_{\cdot j} =\frac{n_{i\cdot}}{n}\frac{n_{\cdot j}}{n} \tag{2}$$ where $n=135$ is the total number of observations, $n_{i\cdot}=(75,60)$ are the column marginal counts and $n_{\cdot j}=(80,55)$ are the row marginal counts.

Under the alternative hypothesis, the cell probabilities are free, but must all sum to 1. Under this hypothesis, the MLEs for the cell probabilities are: $$\tilde{\pi}_{ij}=\frac{n_{ij}}{n}$$

The likelihood ratio is then given by $$\Lambda = \frac{\frac{n!}{n_{11}!n_{12}!n_{21}!n_{22}!} \hat{\pi}_{11}^{n_{11}} \hat{\pi}_{12}^{n_{12}} \hat{\pi}_{21}^{n_{21}} \hat{\pi}_{22}^{n_{22}}} {\frac{n!}{n_{11}!n_{12}!n_{21}!n_{22}!} \tilde{\pi}_{11}^{n_{11}} \tilde{\pi}_{12}^{n_{12}} \tilde{\pi}_{21}^{n_{21}} \tilde{\pi}_{22}^{n_{22}}} = \prod_{i=1}^2\prod_{j=1}^2 \left(\frac{\hat{\pi}_{ij}}{\tilde{\pi}_{ij}}\right)^{n_{ij}} \;\;.$$ Simplifying and taking the logartihm, we have $$\ln \Lambda = \sum_{i=1}^2\sum_{j=1}^2 n_{ij} \ln\left[\frac{n_{i\cdot}n_{\cdot j}}{n_{ij}n}\right]\;\;.\tag{3}$$ and for our table, $\ln\Lambda = -15.53$.

p-value

(edited)

I have two nuisance parameters, $\pi_{1 \cdot}$ and $\pi_{\cdot 1}$. According to @Scortchi, the "the p-value is defined as the maximum of the sum of the probabilities of the more extreme tables over all values of both these nuisance parameters". In other words, $$\text{p-value}= \;\max_{\pi_{1 \cdot}, \pi_{\cdot 1}} \;P\left(\ln \Lambda < -15.53 | \pi_{1 \cdot}, \pi_{\cdot 1} \right)$$ where the number -15.53 was calculated above and $\ln \Lambda$ is given by Eq (3).

This can be rewritten as $$\text{p-value}=\,\max_{\pi_{1 \cdot}, \pi_{\cdot 1}}\;\left\{ \sum_{n_{11},n_{12},n_{21},n_{22}} \;\; \frac{n!}{n_{11}!n_{12}!n_{21}!n_{22}!} \pi_{11}^{n_{11}} \pi_{12}^{n_{12}} \pi_{21}^{n_{21}} \pi_{22}^{n_{22}} \right\}$$ where the summation is over all possible contingency tables which have $\ln \Lambda < -15.53$, and where the $\pi_{ij}$ are given by Eq (1).

The Python script below will calculate the above p-value, but only if you tell it what the nuisance parameters are.

Correct me if I'm wrong, but I don't think there's a way to analytically get the nuisance parameters. We could use a computer to numerically maximize the p-value. I wonder if that would be a convex optimization...

I didn't implement the optimization but I tested out a couple different choices of nuisance parameters and the highest p-value I got was from $\pi_{1 \cdot}=0.8, \pi_{\cdot 1}=0.5$, with p-value = $4.72\times10^{-8}$, which is in good agreement with our original p-value of $4.17\times10^{-8}$.

Script

(edited)

import numpy as np
from math import log, factorial, exp

our_table = np.array([[60,20],[15,40]])

# I found these particular choices for the 2 nuisance parameters to be a good
# choices to maximize p-value (implementing a proper numerical optimization would
# be better of course)
row_probs = np.array([0.8,0.2])
col_probs = np.array([0.5,0.5])
cell_probs = np.outer(row_probs, col_probs)

num_rows = our_table.shape[0]
num_cols = our_table.shape[1]
n = our_table.sum(axis=(0,1))


def binomial(N,k):
    '''N choose k'''
    f = factorial
    return f(N) / (f(k) * f(N-k))


def log_likelihood_ratio(table):
    '''Find log likelihood ratio under the setup of a chi-squared test for
    independence

    table:  array representing a contingency table
    '''
    row_marginals = table.sum(axis=1)
    col_marginals = table.sum(axis=0)
    n = table.sum(axis=(0,1)) # (same as n in the enclosing scope)

    # Now we take the double-summation in the formula for log likelihood ratio
    log_lik_ratio=0
    for row in range(num_rows):
        for col in range(num_cols):
            # if table[i,j] is zero, it is understand that this term is zero
            if table[row,col] != 0:
                log_lik_ratio += table[row,col] * log((row_marginals[row] * col_marginals[col]) \
                                / (n * table[row,col]))

    return log_lik_ratio


def prob_of_table(table, cell_probs):
    '''Given the particular settings of our nuisance parameters, calculate
    the probability of getting a given table, which is just given by a
    multinomial distribution'''
    n = table.sum(axis=(0,1)) # (same as n in the enclosing scope)
    num_rows = our_table.shape[0]
    num_cols = our_table.shape[1]

    # Now build up the log of the multinomial porbability one cell at a time
    # I use logarithms to make this computationally feasible
    log_prob = log(factorial(n))
    for row in range(num_rows):
        for col in range(num_cols):
            log_prob += table[row,col] * log(cell_probs[row,col]) \
                    - log(factorial(table[row,col]))

    return exp(log_prob)


our_log_lik_ratio = log_likelihood_ratio(our_table)

# Find out how many contingency table have equal or smaller likelihoods ratios
# variable to hold cumulative probability for tables more extreme than ours
# by probability, we mean the likelihood of a table under given nuisance parameters
cumul_prob = 0.0

# 3 nested for loops for all possible ways to split points into cells (3 degrees of freedom here)
for count11 in range(0, n+1):
    for count12 in range(0, n-count11+1):
        for count21 in range(0, n-count11-count12+1):
            count22 = n - count11 - count12 - count21
            cont_table = np.array([[count11, count12], [count21, count22]])
            #current log likelihood ratio
            cur_log_lik_ratio = log_likelihood_ratio(cont_table)
            print('__')
            if cur_log_lik_ratio <= our_log_lik_ratio:
                cumul_prob += prob_of_table(cont_table, cell_probs)
                print(cont_table, ' has log_likelihood of ', cur_log_lik_ratio, ' which is <= ', our_log_lik_ratio,'\n')


p_value = cumul_prob
print('For the table, ', our_table, ', the p-value is: ', p_value)

[1]: See Mathematical Statistics and Data Analysis, 3rd ed. by John Rice, Section 13.4.

[2]: Rice, p. 522.

You're counting the more extreme tables rather than summing their probabilities under the null — Scortchi - Reinstate Monica, Sep 25 '14 at 09:42
The issue of calculating the "exact p" value is not so much an issue of how the test statistic is derived, but finding the appropriate small-sample distribution of the test-statistic under the null hypothesis. This is usually surprisingly negligible. — AdamO, Sep 25 '14 at 20:13
Secondly, the likelihood ratio test is certainly the "uniformly most powerful" test, but the Pearson Chi-Square test is a score test which is asymptotically equivalent to the LRT. That means that they usually give very nearly identical inference. — AdamO, Sep 25 '14 at 20:14
Lastly, the Fisher's Exact test is a nice test for $2 \times 2$ contingency tables that has an exact $p$-value (although the test is not of correct size). I wonder if you have encountered that in the course of your deliberation over this topic and what you thought about the major differences between it and other (model based) tests. — AdamO, Sep 25 '14 at 20:16
Note that R will get this probability for you via simulation -- chisq.test(our_table,simulate.p.value=TRUE,B=100000000) run 3 times gave 6, 6 and 8 $\times 10^{-8}$ as p-values, suggesting a p-value something on the order of $7\times 10^{-8}$ for a Pearson chi-square. — Glen_b, Sep 26 '14 at 03:31
@Scortchi, right, I see I was mistaken in assuming all tables have equal probability. So under $\text{H}_0$, how would I calculate the probability of getting a given table? Having a composite null hypothesis is throwing me off. — Garrett, Sep 26 '14 at 04:25
@AdamO, thanks. My motivation for working through this exercise was to prove to myself that the Chi-Squared Tests for Independence and the Chi-Squared Test for Homogeneity gave different p-values (in the non-asymptotic case where the sample size is small). See here. That's why I'm not using Fisher's Exact Test or Chi-Squared tests. — Garrett, Sep 26 '14 at 05:43
@Garret: In the case of a fixed total but no fixed margins you have two nuisance parameters to consider. So the p-value is defined as the maximum of the sum of the probabilities of the more extreme tables over all values of both these nuisance parameters. — Scortchi - Reinstate Monica, Sep 26 '14 at 07:18
@Glen_b, good idea! I thought I might gain some insight into the probabilities for contingency tables by looking at the source for chisq.test(our_table,simulate.p.value=T). But it turns out that to run the MC simulation, it generates tables with fixed row and column marginals (given by your particular marginals), generating exactly the same tables as fisher.test(our_table, simulate.p.value=T). (the function generating the tables for both is rcont2 which I found in R-3.0.1/src/library/stats/src/rcont.c) I don't know what to make of that... — Garrett, Sep 26 '14 at 07:20
@Garret: Perhaps it'd be helpful to look at code from the Exact package, which is for unconditional tests in 2x2 tables. — Scortchi - Reinstate Monica, Sep 26 '14 at 07:27
Garrett - now I've seen what you're trying to achieve - note that both the usual chi squared tests for an $m\times n$ table (homogeneity and independence) condition on both margins. — Glen_b, Sep 26 '14 at 08:22
@Glen_b: Not sure what you mean: if "the usual chi-squared tests" involve the asymptotic approximation of Pearson's statistic to the chi-squared distribution, they work whether you condition on one or both margins, or only the total. Yule's correction is supposed to make the result closer to that of Fisher's exact test for smaller samples. — Scortchi - Reinstate Monica, Sep 26 '14 at 08:27
@Scortchi If you only condition on 1 margin you'll have 2df, and if you only condition on the total, you'll have 3df, unless I've gotten muddled on this (which is definitely possible) — Glen_b, Sep 26 '14 at 08:32
@Glen_b, are you referring to the fact that the Pearson's Chi-Squared statistic approximately follows a $\chi ^2$ distribution with df=1? If so, my understanding was that the test for independence only fixes the total. The df=1 comes from $\text{H}_0$ having 2 df, $\text{H}_A$ having 3 df, and so the $\chi ^2$ distribution has 3-2=1 df. — Garrett, Sep 26 '14 at 08:52
@Glen_b: If the expected frequencies are pre-specified (the G.O.F. test) there are indeed 3 d.f.; if they're estimated from the marginal totals (the test of association) there's only one. Pearson originally used 3 df, but Fisher corrected him (& not by saying you should condition on marginal totals as with his exact test): Stigler (2008), "Karl Pearson’s theoretical errors & the advances they inspired", Statistical Science, 23, 2. — Scortchi - Reinstate Monica, Sep 26 '14 at 14:00
@Scortchi, thanks! I've edited my question to incorporate your definition of the p-value. The problem seems to be solved now. Do you mind just taking a quick glance at the (edited) section on the p-value to make sure I interpreted what you said correctly? Also, I was wondering if there's an analytic way of determining the best choice of nuisance parameters. If this question is solved, could you copy and paste your last comment into an answer so I can mark the question as solved? — Garrett, Sep 26 '14 at 22:25
@Scortchi -- you now seem to be making he point I was trying to make. — Glen_b, Sep 26 '14 at 22:46
@Garrett: That's right. There's no analytic way to find where the candidate p-value attains its maximum over the two nuisance parameters. You can get rid of one by conditioning on either margin, or both by conditioning on both margins; or take a Baysian approach & integrate them out of the posterior density for the odds ratio. — Scortchi - Reinstate Monica, Sep 29 '14 at 10:17
@Glen_b: Don't think so: I'm saying that for tests of independence or homogeneity the exact distribution of Pearson's test statistic depends on the conditioning scheme (on the overall total, on one or other marginal totals, or on both marginal totals), but that asymptotically it tends to the chi-squared distribution with one degree of freedom under any of those conditioning schemes. For a goodness-of-fit test to a pre-specified multinomial distribution the test statistic is different (expectations are not derived from marginal totals) & asymptotically it tends to the chi-squared ... — Scortchi - Reinstate Monica, Sep 29 '14 at 11:31
... distribution with three degree of freedom. You seem to be saying that, for tests of independence with multinomial sampling, if you only condition on the overall total, the distribution of Pearson's test statistic (with expected values derived from marginal totals rather than pre-specified) will asymptotically tend to a chi-squared distribution with three degrees of freedom. That was an error first made by Pearson himself. — Scortchi - Reinstate Monica, Sep 29 '14 at 11:37
@Scortchi, thanks! It seems solved then, can you please turn your comment into an answer? — Garrett, Sep 29 '14 at 19:27

Scortchi - Reinstate Monica · Accepted Answer · 2014-10-15T14:46:45.183

First, rather than counts of more (or as) extreme tables that need to be summed, it's their probabilities under the null hypothesis. Second the null hypothesis is composite, with $\pi_{1.}$ and $\pi_{.1}$ as nuisance parameters: the p-value is defined conservatively as the maximum of the sum of the probabilities of the more extreme tables over all values of both these nuisance parameters. There's no analytic solution, but the approximation of both the likelihood ratio and Pearson's score statistic to a chi-squared distribution with one degree of freedom is often close enough. If you want an exact test then, even for a multinomial sampling scheme, conditioning on either row or column totals (and thus removing one nuisance parameter) can be justified/urged on the grounds that either is ancillary; conditioning on both (and thus removing both nuisance parameters) can be justified/urged on the grounds that together they're approximately ancillary. In a Bayesian analysis you have priors for the nuisance parameters and simply integrate them out to get the posterior density of the odds ratio. Berger & Boos devised some tests that carry out the maximization over a confidence interval for the nuisance parameters, and still maintain the correct size.^†

simulated distribution of Pearson's test statistic under different conditioning schemes

# specify marginal totals
n1. = 80
n2. = 55
n.1 = 75
n.2 = n1. + n2. - n.1
n.. = n1. + n2.

# specify nuisance parameters
pi1. = 0.6
pi.1 = 0.5

# specify no. simulations to perform throughout
no.sims <- 100000

# define function to calculate Pearson's test statistic
pearson.test.stat <- function(x){chisq.test(matrix(x,c(2,2)), correct=F)$statistic}

# simulate conditioning on overall total
multinom.obs <- rmultinom(no.sims,n..,c(pi.1*pi1., (1-pi.1)*pi1., pi.1*(1-pi1.), (1-pi.1)*(1-pi1.)))
apply(multinom.obs , pearson.test.stat, MAR=2) -> multinom.sim.stats

# simulate conditioning on row totals
matrix(NA, 4, no.sims) -> binom.row.obs
binom.row.obs[1, ] <- rbinom(no.sims,n1., pi.1)
binom.row.obs[2, ] <- n1. - binom.row.obs[1, ]
binom.row.obs[3, ] <- rbinom(no.sims,n2., pi.1)
binom.row.obs[4, ] <- n2. - binom.row.obs[3, ]
apply(binom.row.obs, pearson.test.stat, MAR=2) -> binom.row.sim.stats

# simulate conditioning on column totals
matrix(NA, 4, no.sims) -> binom.col.obs
binom.col.obs[1, ] <- rbinom(no.sims,n.1, pi1.)
binom.col.obs[2, ] <- n.1 - binom.col.obs[1, ]
binom.col.obs[3, ] <- rbinom(no.sims,n.2, pi1.)
binom.col.obs[4, ] <- n.2 - binom.col.obs[3, ]
apply(binom.col.obs, pearson.test.stat, MAR=2) -> binom.col.sim.stats

# simulate conditioning on both row & column totals
matrix(NA, 4, no.sims) -> hypergeom.obs
rhyper(no.sims, n.1, n.2, n1.) -> hypergeom.obs[1, ]
n1. - hypergeom.obs[1, ] -> hypergeom.obs[2, ]
n.1 - hypergeom.obs[1, ] -> hypergeom.obs[3, ]
n.. - hypergeom.obs[1, ] - hypergeom.obs[2, ] - hypergeom.obs[3, ] -> hypergeom.obs[4, ]
apply(hypergeom.obs, pearson.test.stat, MAR=2) -> hypergeom.sim.stats

† Berger & Boos (1994), "P-values maximized over a confidence set for the nuisance parameter", JASA, 89, 427