2

Suppose I have random variables $X_1,...,X_N$ with $X_j\sim Bernoullli(p_j)$, and I have a random sample of size $n$ on them: $(X_{1i},...,X_{Ni})_{i=1}^n$. How to test $H_0:X_1,..,X_N$ are mutually independent (vs $H_1:$ they are not mutually independent)? A candidate test statistic I can think of is constructed as follows:

Because by definition of mutual independence, $H_0$ holds if and only if $Pr((X_1,...,X_N)=(x_1,...,x_N))=Pr(X_1=x_1)\times...\times Pr(X_N=x_N)$ for any $(x_1,...,x_N)$, the candidate test statistic just uses the sample analog of this condition:

I estimate $Pr((X_1,...,X_N)=(x_1,...,x_N))$ using $\hat{p}_{x_1...x_N}=\frac{\sum_{i=1}^n\mathbf{1}((X_{1i},...,X_{Ni})=(x_1,...,x_N))}{n}$, and estimate $Pr(X_j=x_j)$ using $\hat{p}_{j,i}=\frac{\sum_{i=1}^n\mathbf{1}(X_{ji}=x_i)}{n}$. Then I construct the following test statistic

$T=n||\begin{bmatrix}\hat{p}_{11...1}-\hat{p_{1,1}}\times\hat{p_{2,1}}...\times\hat{p_{N,1}}\\ \hat{p}_{01...1}-\hat{p_{1,0}}\times\hat{p_{2,1}}...\times\hat{p_{N,1}}\\ ...\\ \hat{p}_{0...01}-\hat{p_{1,0}}\times\hat{p_{2,0}}...\times\hat{p_{N,1}}\\ \end{bmatrix}||_{\hat{V}^{-1}}^2$, where the vector inside the norm has $2^N-1$ rows as I run over all possible $(x_1,...,x_N)$ except for $(x_1,...,x_N)=(0,...,0)$. I need to exclude $(0,...,0)$ as this probability is implied by the other $2^N-1$ probabilities under $H_0$. $\hat{V}^{-1}$ is the inverse of the estimated asymptotic variance-covariance matrix of the vector inside the norm. I should reject the null for large values of $T$.

I have three questions:

  1. Could this test do the job? Intuitively I think it can.

  2. Intuitively I think this test statistic should have an asymptotic $\chi^2_{2^N-1}$ distribution, but because the form of the estimated asymptotic variance $\hat{V}$ is very complex, I would prefer to implement it using bootstrap. My question is how to use bootstrap to do the test and circumvent the need to compute $\hat{V}$ analytically.

  3. Are there any other better test that could also do the job?

Thanks!

ExcitedSnail
  • 2,768

1 Answers1

1

Nothing new under the sun. You are reinventing higher dimensional contingencies tables. See here: https://stats.stackexchange.com/a/148174/341520

Also you have overlooked to subtract the df for estimating $P(X_i = 1)$ with the $\hat{p}_{i, 1}$; $\hat{p}_{i, 0}$ being redundant, but a nice notational touch.

Here's some R-Code that demonstrates how to use log-linear-models in your case

library(tidyverse)
library(MASS)
n <- 5000
N <- 10
p <- rbeta(N, 41, 41) # nice ps around 0.5
hist(p)
res <- t(replicate(n, {rbinom(N, size = 1, prob = p)}))
df <- as.data.frame(res)

I don't want to use tab, because it doesn't work with formula = ~.

so i have to make sure all combinations show up, even those tht aren't in the data

tab <- table(df) which(tab == 0) eval_str <- paste("expand.grid(", paste(rep("c(0,1)", N), collapse = ","), ")", sep = "") combs <- eval(parse(text = eval_str))

first add on the extra combinations

df_counts <- rbind(df, setNames(combs, names(df))) %>% group_by_all() %>% summarise(COUNT = n() - 1) #now subtract the extra

all is well

loglm(COUNT ~ ., data = df_counts)

increase p_1,1 by 25% depending on V2-V5

df2 <- df %>% mutate(V1 = if_else(V2 == V3 & V4 == 1 & V5 == 0, rbinom(1, 1, p[1]*1.25), V1)) df2_count <- rbind(df2, setNames(combs, names(df))) %>% group_by_all() %>% summarise(COUNT = n() - 1)

problem detectes

loglm(COUNT ~ ., data = df2_count)

it's a 5 level interaction, hundreds of coefficients make the model struggle

loglm(COUNT ~ (.)^5, data = df2_count)

What did i do in the Code:

I set $n = 5000, N = 10$ and i draw the true $(p_i)_{1,..., N}$ from a Beta-distribution with $\alpha =\beta = 41$. I then put my simualtion results int a data-frame which i (badly, given the context degrees of freedom) call df for short. df is basically a matrix with each realization $(X_{1 i}, ..., X_{N i})$ as its $i$th row.

Then I create the 'df_count' which contains all $2^N$ possible combinations of $x_{1}, ..., x_{N})$ , with each $x_i$ in it's own column and a final column called COUNT, showing $\Sigma 1 ((X_1, ..., X_N) = (x_1, ..., x_n))$ including the ones which didn't show up in the simulation, which is the difficult part in the middle. The other Columns of df_count are now called V, with $Vi = \begin{pmatrix} x_{i1} \\ x_{i2}\\ ...\\ x_{in} \end{pmatrix} $ and then i fit a log linear model with ~ ., which means use all remaining columns. How log linear modells work you have research yourself(https://en.wikipedia.org/wiki/Log-linear_analysis ), but the basic idea is $\log(p_{x_1,...}) = \log(P(X_1 = x_1)) + ...$

The result is just that the Variation from this modell to a saturated one (one paramter per count) is chi-Squared distrubted and sensibly sized. I the create a dependence by $P(X_1 = 1| X_2 = X_3 \wedge X_4 = 1\wedge X_5 = 0) = 1.25*P(X_1 = 1| otherwise)$ and the model picks up on the disturbance.

I can then add interactions, which are basically parameters for intersections, here i have to model 5 levels deep.

Lukas Lohse
  • 2,482
  • Thanks, Lukas! This is very helpful! You are right, I need to substract the dfs for estimating these probabilities, will try to figure out the exact number. Also, since I'm not an R user, could you also illustrate what you are doing with the R codes using math formulas? Thanks again! – ExcitedSnail Nov 17 '22 at 15:30
  • 1
    Ok, but only because I'm training my Latex. Also I highly recommend you pick a programming language to support your investigations with some simulations. – Lukas Lohse Nov 17 '22 at 20:18
  • Thank you very much! – ExcitedSnail Nov 20 '22 at 03:28