2

Given two categorical variables $A$ and $B$ with the same number of categories, and two observations with frequencies $$\begin{array}{c c} A & B\\ a_1 & b_1\\ a_2 & b_2\\ a_3 & b_3\\ a_4 & b_4\\ a_5 & b_5\\ \end{array}$$ then we can compute the $\chi^2$ statistic to test for independence: $$n_i=a_i+b_i\quad \quad m_1=\sum_{i=1}^na_i\quad\quad m_2=\sum_{i=1}^5b_i\quad\quad N=m_1+m_2$$ $$A_i=\frac{m_1n_i}{N}\quad\quad B_i=\frac{m_2n_i}{N}$$ $$\chi^2=\sum_{i=1}^5\frac{(a_i-A_i)^2}{A_i}+\sum_{i=1}^5\frac{(b_i-B_i)^2}{B_i}$$

and Cramer's V is calculated as $$V=\sqrt{\frac{\chi^2}{4N}}$$

Is this correct? Maybe all my questions come from a misunderstanding in these formulas.

I want to know the following: $\chi^2$ potentially takes values from $0$ to $\infty$ and $V$ is between $0$ and $1$. The closer $V$ is to $1$, the stronger the association is between $A$ and $B$

But lets say $A$ and $B$ are identical ($a_i=b_i$). Then $\chi^2=0$ and we sustain the null hypothesis, i.e $A$ and $B$ are dependent. But then $V=0$ and we conclude that there is no association between $A$ and $B$. I don't understand.

The greater the value of $\chi^2$, the greater the chance we reject the null hypothesis, i.e conclude that $A$ and $B$ are independent, but then $V$ is also very large, concluding that $A$ and $B$ are strongly associated. The converse also seems counterintuitive

Many thanks!

  • Welcome to CV. Since you’re new here, you may want to take our [tour], which has information for new users. I think there is a misunderstanding here. If A and B are identical, $\chi^2$ would be very high and Cramer's V would be 1. It also looks like you are not working with a contingency table (cross-tabulation) but univariate distributions separately. And I am not sure that is the right way to calculate $\chi^2$ – T.E.G. Dec 10 '21 at 10:55
  • @T.E.G. Hi, thanks for the link. Do you know how can I solve this misconceptions? – augustoperez Dec 10 '21 at 11:20

1 Answers1

3

I think your confusion comes from several things.

In the first place, your calculations are theoretically correct, but apply to a two-way contingency table. The rub here seems to be that it looks like you have an incorrect definition of what a contingency table is. So it's likely that you're applying your calculations to the wrong object.

If your table were a two-way contingency table, $A$ and $B$ would not be variables, but levels of a single one variable. The rows would also be the levels of another single one variable.

To give a more concrete example, if you wanted to test the association between the variables "favorite programming language" and "favorite color", here is what a contingency table would look like:

Python R Total
Blue 450 520 970
Red 1800 2080 3880
Green 450 520 970
Yellow 120 140 260
Total 2820 3260 6080

Here, the variable "programming language" is allocated to 2 columns representing 2 levels (Python and R), while the variable "favorite color" is allocated to 4 rows representing 4 levels (blue, red, green, and yellow). The table reads as follow: 450 Python users prefer the color blue, 1800 Python users prefer the color red, 520 R users prefer the color blue, 140 R users prefer the color yellow, etc.

You would have to apply your calculations to this $2\times4$ table.

All of this is assuming each individual represented in this table can pick only one language and only one color. It would otherwise violate the assumption of mutually exclusive levels inside a variable, which is a necessary assumption to conduct a valid chi-squared test of independence.


Secondly, when conducting a chi-squared test of independence, the null hypothesis is independence, not dependence. This is contrary to what you seem to assume when you write "we sustain the null hypothesis, i.e $A$ and $B$ are dependent". It seems to be another major explanation for your confusion.


Finally, it seems that you're somehow mixing up the chi-squared statistic and the chi-squared test p-value, when you say "Then $\chi^2=0$ and we sustain the null hypothesis". You choose to reject the null hypothesis based on the p-value calculated from the chi-squared statistic, not based on the chi-squared statistic alone.

Indeed, once you calculated the chi-squared statistic, you still have to test it against the chi-square distribution, which will tell you the probability (p-value) of observing such a chi-squared statistic with these degrees of freedom, if the null hypothesis is true.

If you want to calculate the p-value by hand, you will find an explanation here: How is $\chi^2$ value converted to p-value? But it's really a bore to do by hand, and most data analysis tools offer functions to automatically run the p-value calculation for you. Before computers were as widespread as they are today, people usually referred to tables of p-values vs. $\chi^2$ values to get a p-value from a given chi-square statistic and degrees of freedom. There is an example of such a table on the Wikipedia article on the subject.


In conclusion: In the contingency table "Programming language vs. favorite color" above, a chi-squared test of independence yields a chi-squared statistic pretty close to zero ($\approx 0.006$). If you test this value against the chi-squared distribution with 3 degrees of freedom, it will output a p-value very close to one ($\approx 0.99988$ according to scipy chi2_contingency function). It means that certainly you would not reject the null hypothesis, i.e. you won't reject the hypothesis of "programming language" and "favorite color" being independent.

This is in line with the extremely small Cramér's V value resulting from this table ($\approx 0.001$).


While on the subject, unless you want to do it for self-learning purposes, it's generally simpler and better to use automatic chi-squared test functions that statistical software offer, rather than calculating everything yourself by hand like you did in your original question. It takes literally two or three lines of code. Here is an example in Python, with the counts from the "programming language vs. color" contingency table:

from scipy import stats 
my_table = [[450,1800,450,120], [520,2080,520,140]]
stats.chi2_contingency(my_table )

returns:

(0.0056642907768638725, #chi-squared statistic
 0.9998868122974477, #p-value; very close to 1, so it's probably unwise to reject the null hypothesis of independence
 3, #degrees of freedom
 array([[ 449.90131579, 1799.60526316,  449.90131579,  120.59210526],
        [ 520.09868421, 2080.39473684,  520.09868421,  139.40789474]])) #expected values
J-J-J
  • 4,098