2

I am trying to calculate if there is a dependence between two categorical columns, or a significant cooccurence of two categorical variables. Let's say I have columns A and B, and and out of 200K rows A is '1' 13683 and B is '1' 35171 times. Out of this, 2871 rows have both A and B '1' at the same time.

How do I calculate if there is a dependence between two columns? I tried to calculate expected ratios for both A and B, 0.068 0.175, and if assuming both were binomial, randomly both A an B being 1 is 0.068 * 0.175 = 0.01203. So any observed ratio significantly higher than this could signify the coocurence is not due to random chance? For that, I had the idea of using 0.01203 as "population parameter" and check with binomial test if observed ratio is significantly higher. Is a chi square test better for this? How would it be used here?

Thanks,

BBSysDyn
  • 1,012
  • 1
  • 9
  • 19
  • Do you want to check if two categorical variables $A$ and $B$, each with two levels ($A = 1$ and $A \ne 1$ plus $B = 1$ and $B \ne 1$) are independent? – Alexey Grigorev Aug 05 '14 at 15:20
  • Just A=1 and B=1 is enough. I want to check if the occurence of such an event, based on counts, is significant, and by how much. – BBSysDyn Aug 05 '14 at 16:04

1 Answers1

3

$\chi^2$ Test

For a $\chi^2$ test of independence, we have to have a contingency table, so we assume two modalities for each variable: $\text{dom}(A) = \text{dom}(B) = \{`1`, `\text{not} 1`\}$

With little bit of calculation we have the following table:

\begin{array}{|c|l|l|r|} \hline & 1 & \text{not } 1 & \text{total} \\ \hline 1 & 2871 & 32300 & 35171 \\ \text{not } 1 & 10812 & 154017 & 164829 \\ \hline & 13683 & 186317 & 200000 \\ \hline \end{array}

Given that we expect $P(A=1) = 0.068$ and $P(B=1) = 0.175$, we have the following table with expected values

\begin{array}{|c|l|l|r|} \hline & 1 & \text{not } 1 & \text{total} \\ \hline 1 & 2380 & 32620 & 35000 \\ \text{not } 1 & 11220 & 153780 & 165000 \\ \hline & 13600 & 186400 & 200000 \\ \hline \end{array}

Now we calculate the normalized squared difference $X^2 = \sum_i \frac{(O_i-E_i)^2}{E_i}$ which follows the $\chi^2$ distribution with $\text{df} = 1$.

In this case, $X^2 = 119.6353$ which is very unusual ($p \approx 10^{-28}$) to observe under assumption that $A$ and $B$ are independent. So we conclude that the variables are not independent.

Also we could calculate Crammer's V, which, in this case, is $v = \sqrt{ X^2 / n } = \sqrt{119.6353 / 200000} \approx 0.02$, meaning that the correlation is strong

R code

obs = matrix(data=c(2871, 32300, 10812, 154017), nrow=2, ncol=2,
             byrow=T, dimnames=list(A=c('1', 'not 1'), B=c('1', 'not 1')))

exp = matrix(data=c(2380, 32620, 11220, 153780), nrow=2, ncol=2,
             byrow=T, dimnames=list(A=c('1', 'not 1'), B=c('1', 'not 1')))

chi2 = sum((obs - exp)^2 / exp)
pchisq(q=chi2, df=1, lower.tail=F)

sqrt(chi2 / sum(obs))

References