margins=True option in pd.crosstab influences the outcome of chi2_contingency test. Why?

Question

I understand what the margins=True option in pd.crosstab does, but I don't understand why it would influence the outcome of the chi2_contingency. Here an example:

#data_crosstab:
srm       no  yes  All
version

<V4      132  105  237
V4        29   24   53
All      161  129  290
chi2_contingency(data_crosstab, correction=False)
#yields
(0.016817770389843306,
 0.9999648428969145,
 4,
 array([[131.57586207, 105.42413793, 237.        ],
        [ 29.42413793,  23.57586207,  53.        ],
        [161.        , 129.        , 290.        ]]))
#while
#data_crosstab:
srm       no  yes
version

<V4      132  105
V4        29   24
chi2_contingency(data_crosstab, correction=False)
#yields
(0.016817770389843306,
 0.896816958766594,
 1,
 array([[131.57586207, 105.42413793],
        [ 29.42413793,  23.57586207]]))

I see that the DOF are different, but I really don't understand the role of the option margins. Thanks!

J-J-J · Accepted Answer · 2023-02-22T21:26:17.933

Your question seems to be a mix of programming and statistical questions.

As for the statistical aspect, the difference of output comes from the fact that you are analyzing two different contingency tables with different degrees of freedom.

The degrees of freedom for the chi-square on a two-dimensional contingency table are calculated as:

$k = (nrows-1)*(ncols-1)$

where $k$ is the degrees of freedom, $nrows$ is the number of rows, and $ncols$ is the number of columns. The degrees of freedom do have an impact on the p-value calculation.

As for the programming aspect of your question, scipy.chi2_contingency() does not "know" that the margins (called "All") are margins, and treats them like they were just other categories. So:

when you use the margins=True, parameter, scipy.chi2_contingency() sees a 3X3 table (i.e. 4 degrees of freedom), and treats the margins as if they were categories:

	category_A	category_B	category_C
cat_D	132	105	237
cat_E	29	24	53
cat_F	161	129	290

when you don't use margins=True, scipy.chi2_contingency() simply sees a 2X2 table (i.e. 1 degree of freedom), as it should:

	category_A	category_B
cat_C	132	105
cat_D	29	24

As the degrees of freedom are different for these two tables (4 vs. 1), you end up with two different p-values, even though the two tables have the same chi-square statistic (0.016817770389843306).

So in short, what you should do to get a correct result is:

chi2_contingency(pandas.crosstab(df["version"], df["srm"], margins=False))

Thank you very much for the detailed answer. I understand now!
In another [answer] (https://stats.stackexchange.com/questions/103876/what-does-conditioning-on-the-margins-of-mean), I found out that margins should be used when the margins are fixed. I assumed that is my case, so I am not sure I understand why margins=False would give me the correct result. — Chiara, Feb 23 '23 at 14:05
@Chiara chi2_contingency automatically takes care of calculating the margins for you, this is why you have to pass the table without the margins (the parameter margins=False in the crosstab function makes sure that the generated table that you'll pass does not contain the margins; the margins will be calculated under the hood by chi2_contingency()). — J-J-J, Feb 23 '23 at 14:24
@Chiara You can also have a look at the chi2_contingency documentation. The tables they use in the "Examples" section do not contain margins, as margins are internally calculated by the function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html — J-J-J, Feb 23 '23 at 14:34

margins=True option in pd.crosstab influences the outcome of chi2_contingency test. Why?

1 Answers1