Cooccurence of Two Categorical Variables

Question

I am trying to calculate if there is a dependence between two categorical columns, or a significant cooccurence of two categorical variables. Let's say I have columns A and B, and and out of 200K rows A is '1' 13683 and B is '1' 35171 times. Out of this, 2871 rows have both A and B '1' at the same time.

How do I calculate if there is a dependence between two columns? I tried to calculate expected ratios for both A and B, 0.068 0.175, and if assuming both were binomial, randomly both A an B being 1 is 0.068 * 0.175 = 0.01203. So any observed ratio significantly higher than this could signify the coocurence is not due to random chance? For that, I had the idea of using 0.01203 as "population parameter" and check with binomial test if observed ratio is significantly higher. Is a chi square test better for this? How would it be used here?

Thanks,

Do you want to check if two categorical variables $A$ and $B$, each with two levels ($A = 1$ and $A \ne 1$ plus $B = 1$ and $B \ne 1$) are independent? — Alexey Grigorev, Aug 05 '14 at 15:20
Just A=1 and B=1 is enough. I want to check if the occurence of such an event, based on counts, is significant, and by how much. — BBSysDyn, Aug 05 '14 at 16:04

score 3 · Accepted Answer · answered Aug 05 '14 at 17:07

$\chi^2$ Test

For a $\chi^2$ test of independence, we have to have a contingency table, so we assume two modalities for each variable: $\text{dom}(A) = \text{dom}(B) = \{`1`, `\text{not} 1`\}$

With little bit of calculation we have the following table:

\begin{array}{|c|l|l|r|} \hline & 1 & \text{not } 1 & \text{total} \\ \hline 1 & 2871 & 32300 & 35171 \\ \text{not } 1 & 10812 & 154017 & 164829 \\ \hline & 13683 & 186317 & 200000 \\ \hline \end{array}

Given that we expect $P(A=1) = 0.068$ and $P(B=1) = 0.175$, we have the following table with expected values

\begin{array}{|c|l|l|r|} \hline & 1 & \text{not } 1 & \text{total} \\ \hline 1 & 2380 & 32620 & 35000 \\ \text{not } 1 & 11220 & 153780 & 165000 \\ \hline & 13600 & 186400 & 200000 \\ \hline \end{array}

Now we calculate the normalized squared difference $X^2 = \sum_i \frac{(O_i-E_i)^2}{E_i}$ which follows the $\chi^2$ distribution with $\text{df} = 1$.

In this case, $X^2 = 119.6353$ which is very unusual ($p \approx 10^{-28}$) to observe under assumption that $A$ and $B$ are independent. So we conclude that the variables are not independent.

Also we could calculate Crammer's V, which, in this case, is $v = \sqrt{ X^2 / n } = \sqrt{119.6353 / 200000} \approx 0.02$, meaning that the correlation is strong

R code

obs = matrix(data=c(2871, 32300, 10812, 154017), nrow=2, ncol=2,
             byrow=T, dimnames=list(A=c('1', 'not 1'), B=c('1', 'not 1')))

exp = matrix(data=c(2380, 32620, 11220, 153780), nrow=2, ncol=2,
             byrow=T, dimnames=list(A=c('1', 'not 1'), B=c('1', 'not 1')))

chi2 = sum((obs - exp)^2 / exp)
pchisq(q=chi2, df=1, lower.tail=F)

sqrt(chi2 / sum(obs))

References

Caveat, with such a large sample size, one would expect even tiny differences in occurrence to produce "strong" significance levels. See Why does frequentist hypothesis testing become biased towards rejecting the null hypothesis with sufficiently large samples? — Alexis, Aug 05 '14 at 19:18

Cooccurence of Two Categorical Variables

1 Answers1

$\chi^2$ Test

R code

References

Linked