Background
I have the following contingency table
+------+-----------+-----------+
| | X | Not X |
+------+-----------+-----------+-------+
| Y | 60 | 20 | 80 |
+------+-----------+-----------+-------+
| notY | 15 | 40 | 55 |
+------+-----------+-----------+-------+
| 75 | 60 | 135 |
+-----------+-----------+-------+
where I've included the row and column marginals as well as the table total.
I would like to know if the rows and columns are independent or not. I could just use a Pearson's Chi-Square Test for Independence to get the p-value, which in R is
our_table <- matrix(c(60,20,15,40),nrow=2,byrow=T)
pvalue <- chisq.test(our_table, correct=F)$p.value
This gives a p-value of $4.17\times10^{-8}$, which seems reasonable since this table does not look independent.
My Goal
As an exercise, I attempted to calculate the exact p-value for my contingency table by looking at every possible contingency table constrained to have 135 observations and see how many are as or more extreme than mine. Note: I don't fix the row or column marginals, just as in the Chi-Squared Test for Independence.
Pearson's Chi-Square is just an approximation which was derived from a (generalized) likelihood ratio test[1] so I will be working with this test directly.
Likelihood Ratio
The marginal probabilities for the rows and columns are $$\pi_{i\cdot}=\sum_{j=1}^2\pi_{ij}$$ and $$\pi_{\cdot j}=\sum_{i=1}^2\pi_{ij}$$ respectively, where $\pi_{ij}$ is the probability of an observation being in the i-th row and j-th column.
The null hypothesis is that the rows and columns are independent, i.e. $$\text{H}_0:\;\; \pi_{ij}=\pi_{i\cdot}\pi_{\cdot j},\;\; \forall i,j\;\;,\tag{1}$$ and under the null hypothesis, the MLEs for the $\pi_{ij}$ are given by[2] $$\hat{\pi}_{ij}=\hat{\pi}_{i\cdot}\hat{\pi}_{\cdot j} =\frac{n_{i\cdot}}{n}\frac{n_{\cdot j}}{n} \tag{2}$$ where $n=135$ is the total number of observations, $n_{i\cdot}=(75,60)$ are the column marginal counts and $n_{\cdot j}=(80,55)$ are the row marginal counts.
Under the alternative hypothesis, the cell probabilities are free, but must all sum to 1. Under this hypothesis, the MLEs for the cell probabilities are: $$\tilde{\pi}_{ij}=\frac{n_{ij}}{n}$$
The likelihood ratio is then given by $$\Lambda = \frac{\frac{n!}{n_{11}!n_{12}!n_{21}!n_{22}!} \hat{\pi}_{11}^{n_{11}} \hat{\pi}_{12}^{n_{12}} \hat{\pi}_{21}^{n_{21}} \hat{\pi}_{22}^{n_{22}}} {\frac{n!}{n_{11}!n_{12}!n_{21}!n_{22}!} \tilde{\pi}_{11}^{n_{11}} \tilde{\pi}_{12}^{n_{12}} \tilde{\pi}_{21}^{n_{21}} \tilde{\pi}_{22}^{n_{22}}} = \prod_{i=1}^2\prod_{j=1}^2 \left(\frac{\hat{\pi}_{ij}}{\tilde{\pi}_{ij}}\right)^{n_{ij}} \;\;.$$ Simplifying and taking the logartihm, we have $$\ln \Lambda = \sum_{i=1}^2\sum_{j=1}^2 n_{ij} \ln\left[\frac{n_{i\cdot}n_{\cdot j}}{n_{ij}n}\right]\;\;.\tag{3}$$ and for our table, $\ln\Lambda = -15.53$.
p-value
(edited)
I have two nuisance parameters, $\pi_{1 \cdot}$ and $\pi_{\cdot 1}$. According to @Scortchi, the "the p-value is defined as the maximum of the sum of the probabilities of the more extreme tables over all values of both these nuisance parameters". In other words, $$\text{p-value}= \;\max_{\pi_{1 \cdot}, \pi_{\cdot 1}} \;P\left(\ln \Lambda < -15.53 | \pi_{1 \cdot}, \pi_{\cdot 1} \right)$$ where the number -15.53 was calculated above and $\ln \Lambda$ is given by Eq (3).
This can be rewritten as $$\text{p-value}=\,\max_{\pi_{1 \cdot}, \pi_{\cdot 1}}\;\left\{ \sum_{n_{11},n_{12},n_{21},n_{22}} \;\; \frac{n!}{n_{11}!n_{12}!n_{21}!n_{22}!} \pi_{11}^{n_{11}} \pi_{12}^{n_{12}} \pi_{21}^{n_{21}} \pi_{22}^{n_{22}} \right\}$$ where the summation is over all possible contingency tables which have $\ln \Lambda < -15.53$, and where the $\pi_{ij}$ are given by Eq (1).
The Python script below will calculate the above p-value, but only if you tell it what the nuisance parameters are.
Correct me if I'm wrong, but I don't think there's a way to analytically get the nuisance parameters. We could use a computer to numerically maximize the p-value. I wonder if that would be a convex optimization...
I didn't implement the optimization but I tested out a couple different choices of nuisance parameters and the highest p-value I got was from $\pi_{1 \cdot}=0.8, \pi_{\cdot 1}=0.5$, with p-value = $4.72\times10^{-8}$, which is in good agreement with our original p-value of $4.17\times10^{-8}$.
Script
(edited)
import numpy as np
from math import log, factorial, exp
our_table = np.array([[60,20],[15,40]])
# I found these particular choices for the 2 nuisance parameters to be a good
# choices to maximize p-value (implementing a proper numerical optimization would
# be better of course)
row_probs = np.array([0.8,0.2])
col_probs = np.array([0.5,0.5])
cell_probs = np.outer(row_probs, col_probs)
num_rows = our_table.shape[0]
num_cols = our_table.shape[1]
n = our_table.sum(axis=(0,1))
def binomial(N,k):
'''N choose k'''
f = factorial
return f(N) / (f(k) * f(N-k))
def log_likelihood_ratio(table):
'''Find log likelihood ratio under the setup of a chi-squared test for
independence
table: array representing a contingency table
'''
row_marginals = table.sum(axis=1)
col_marginals = table.sum(axis=0)
n = table.sum(axis=(0,1)) # (same as n in the enclosing scope)
# Now we take the double-summation in the formula for log likelihood ratio
log_lik_ratio=0
for row in range(num_rows):
for col in range(num_cols):
# if table[i,j] is zero, it is understand that this term is zero
if table[row,col] != 0:
log_lik_ratio += table[row,col] * log((row_marginals[row] * col_marginals[col]) \
/ (n * table[row,col]))
return log_lik_ratio
def prob_of_table(table, cell_probs):
'''Given the particular settings of our nuisance parameters, calculate
the probability of getting a given table, which is just given by a
multinomial distribution'''
n = table.sum(axis=(0,1)) # (same as n in the enclosing scope)
num_rows = our_table.shape[0]
num_cols = our_table.shape[1]
# Now build up the log of the multinomial porbability one cell at a time
# I use logarithms to make this computationally feasible
log_prob = log(factorial(n))
for row in range(num_rows):
for col in range(num_cols):
log_prob += table[row,col] * log(cell_probs[row,col]) \
- log(factorial(table[row,col]))
return exp(log_prob)
our_log_lik_ratio = log_likelihood_ratio(our_table)
# Find out how many contingency table have equal or smaller likelihoods ratios
# variable to hold cumulative probability for tables more extreme than ours
# by probability, we mean the likelihood of a table under given nuisance parameters
cumul_prob = 0.0
# 3 nested for loops for all possible ways to split points into cells (3 degrees of freedom here)
for count11 in range(0, n+1):
for count12 in range(0, n-count11+1):
for count21 in range(0, n-count11-count12+1):
count22 = n - count11 - count12 - count21
cont_table = np.array([[count11, count12], [count21, count22]])
#current log likelihood ratio
cur_log_lik_ratio = log_likelihood_ratio(cont_table)
print('__')
if cur_log_lik_ratio <= our_log_lik_ratio:
cumul_prob += prob_of_table(cont_table, cell_probs)
print(cont_table, ' has log_likelihood of ', cur_log_lik_ratio, ' which is <= ', our_log_lik_ratio,'\n')
p_value = cumul_prob
print('For the table, ', our_table, ', the p-value is: ', p_value)
[1]: See Mathematical Statistics and Data Analysis, 3rd ed. by John Rice, Section 13.4.
[2]: Rice, p. 522.

chisq.test(our_table,simulate.p.value=TRUE,B=100000000)run 3 times gave 6, 6 and 8 $\times 10^{-8}$ as p-values, suggesting a p-value something on the order of $7\times 10^{-8}$ for a Pearson chi-square. – Glen_b Sep 26 '14 at 03:31chisq.test(our_table,simulate.p.value=T). But it turns out that to run the MC simulation, it generates tables with fixed row and column marginals (given by your particular marginals), generating exactly the same tables asfisher.test(our_table, simulate.p.value=T). (the function generating the tables for both isrcont2which I found inR-3.0.1/src/library/stats/src/rcont.c) I don't know what to make of that... – Garrett Sep 26 '14 at 07:20Exactpackage, which is for unconditional tests in 2x2 tables. – Scortchi - Reinstate Monica Sep 26 '14 at 07:27