0

I understand what the margins=True option in pd.crosstab does, but I don't understand why it would influence the outcome of the chi2_contingency. Here an example:

#data_crosstab:

srm no yes All version
<V4 132 105 237 V4 29 24 53 All 161 129 290

chi2_contingency(data_crosstab, correction=False) #yields (0.016817770389843306, 0.9999648428969145, 4, array([[131.57586207, 105.42413793, 237. ], [ 29.42413793, 23.57586207, 53. ], [161. , 129. , 290. ]]))

#while #data_crosstab: srm no yes version
<V4 132 105 V4 29 24

chi2_contingency(data_crosstab, correction=False) #yields (0.016817770389843306, 0.896816958766594, 1, array([[131.57586207, 105.42413793], [ 29.42413793, 23.57586207]]))

I see that the DOF are different, but I really don't understand the role of the option margins. Thanks!

Chiara
  • 23

1 Answers1

1

Your question seems to be a mix of programming and statistical questions.

As for the statistical aspect, the difference of output comes from the fact that you are analyzing two different contingency tables with different degrees of freedom.

The degrees of freedom for the chi-square on a two-dimensional contingency table are calculated as:

$k = (nrows-1)*(ncols-1)$

where $k$ is the degrees of freedom, $nrows$ is the number of rows, and $ncols$ is the number of columns. The degrees of freedom do have an impact on the p-value calculation.

As for the programming aspect of your question, scipy.chi2_contingency() does not "know" that the margins (called "All") are margins, and treats them like they were just other categories. So:

  • when you use the margins=True, parameter, scipy.chi2_contingency() sees a 3X3 table (i.e. 4 degrees of freedom), and treats the margins as if they were categories:
category_A category_B category_C
cat_D 132 105 237
cat_E 29 24 53
cat_F 161 129 290
  • when you don't use margins=True, scipy.chi2_contingency() simply sees a 2X2 table (i.e. 1 degree of freedom), as it should:
category_A category_B
cat_C 132 105
cat_D 29 24

As the degrees of freedom are different for these two tables (4 vs. 1), you end up with two different p-values, even though the two tables have the same chi-square statistic (0.016817770389843306).

So in short, what you should do to get a correct result is:

chi2_contingency(pandas.crosstab(df["version"], df["srm"], margins=False))
J-J-J
  • 4,098
  • Thank you very much for the detailed answer. I understand now!

    In another [answer] (https://stats.stackexchange.com/questions/103876/what-does-conditioning-on-the-margins-of-mean), I found out that margins should be used when the margins are fixed. I assumed that is my case, so I am not sure I understand why margins=False would give me the correct result.

    – Chiara Feb 23 '23 at 14:05
  • @Chiara chi2_contingency automatically takes care of calculating the margins for you, this is why you have to pass the table without the margins (the parameter margins=False in the crosstab function makes sure that the generated table that you'll pass does not contain the margins; the margins will be calculated under the hood by chi2_contingency()). – J-J-J Feb 23 '23 at 14:24
  • I see! Thank you so much! – Chiara Feb 23 '23 at 14:33
  • 1
    @Chiara You can also have a look at the chi2_contingency documentation. The tables they use in the "Examples" section do not contain margins, as margins are internally calculated by the function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html – J-J-J Feb 23 '23 at 14:34