I need help understanding chi2 independence test scipy.stats.chi2_contingency.
Let's assume I have two samples (of different sizes) of a categorical variable with 3 possible outcomes (1, 2, 3):
Counts of the outcomes for the two samples are as follows:
Sample 1: {1: 1000, 2:1000, 3:2000}
Sample 2: {1: 2000, 2:2000, 3:4000}
They are clearly dependent and come from the same distribution. Sample sizes are fairly large, so we should be able to show statistical significance of this.
So, I compute the chi2 statistic and p-value for the test of independence of the observed frequencies in the contingency table.
My contingency table is:
[[1000, 1000, 2000],
[2000, 2000, 4000]]
and the output of the chi2 independence test (from scipy) for this case is:
(0.0,
1.0,
2,
array([[1000., 1000., 2000.],
[2000., 2000., 4000.]]))
According to Wikipedia:
For the test of independence, also known as the test of homogeneity, a chi-squared probability of less than or equal to 0.05 (or the chi-squared statistic being at or larger than the 0.05 critical point) is commonly interpreted by applied workers as justification for rejecting the null hypothesis that the row variable is independent of the column variable.[4] The alternative hypothesis corresponds to the variables having an association or relationship where the structure of this relationship is not specified.
So, p-value is 1.0, which means that we don't reject null hypothesis that the occurrence of outcomes for the two samples is independent.
Let's consider two other (slightly different) samples:
Sample 1: {1: 900, 2: 900, 3: 2100}
Sample 2: {1: 2100, 2: 2100, 3: 3900}
For these samples, contingency table is:
[[900, 900, 2100],
[2100, 2100, 3900]]
And the chi2 independence test result is now:
(34.18803418803419,
3.7684495236693435e-08,
2,
array([[ 975., 975., 1950.],
[2025., 2025., 4050.]]))
Now we have very small p-value which means that we reject null hypothesis and that there is a relationship between the occurrence of outcomes for the two samples.
My question is: How is this possible?
It seems I don't understand something with the chi2 independence test. To me, the first two samples are clearly from the same distribution. Their p-value for chi2 independence test should be close to 0. For the second samples, they are similar but p-value should be much higher than for the first two samples.
According to my understanding of the chi2 independence test is that the p-value should be smaller if counts for the two samples are more similar and higher if there is more mismatch. Where am I wrong?
For more context, I want to apply chi2 independence test in a random split into test and control groups to test if the two groups are not statistically different with respect to a categorical variable. I generate splits into test and control groups until I find one that has p-value < 0.05 which should prove that the two groups are similar (with 95% significance level). Maybe chi2 independence test is not the right statistical tool for this?
