P-value of chi-square test of independence

Question

I need help understanding chi2 independence test scipy.stats.chi2_contingency.

Let's assume I have two samples (of different sizes) of a categorical variable with 3 possible outcomes (1, 2, 3):

Counts of the outcomes for the two samples are as follows:

Sample 1: {1: 1000, 2:1000, 3:2000}
Sample 2: {1: 2000, 2:2000, 3:4000}

They are clearly dependent and come from the same distribution. Sample sizes are fairly large, so we should be able to show statistical significance of this.

So, I compute the chi2 statistic and p-value for the test of independence of the observed frequencies in the contingency table.

My contingency table is:

[[1000, 1000, 2000],
 [2000, 2000, 4000]]

and the output of the chi2 independence test (from scipy) for this case is:

(0.0,
 1.0,
 2,
 array([[1000., 1000., 2000.],
        [2000., 2000., 4000.]]))

According to Wikipedia:

For the test of independence, also known as the test of homogeneity, a chi-squared probability of less than or equal to 0.05 (or the chi-squared statistic being at or larger than the 0.05 critical point) is commonly interpreted by applied workers as justification for rejecting the null hypothesis that the row variable is independent of the column variable.[4] The alternative hypothesis corresponds to the variables having an association or relationship where the structure of this relationship is not specified.

So, p-value is 1.0, which means that we don't reject null hypothesis that the occurrence of outcomes for the two samples is independent.

Let's consider two other (slightly different) samples:

Sample 1: {1: 900, 2: 900, 3: 2100}
Sample 2: {1: 2100, 2: 2100, 3: 3900}

For these samples, contingency table is:

[[900, 900, 2100],
 [2100, 2100, 3900]]

And the chi2 independence test result is now:

(34.18803418803419,
 3.7684495236693435e-08,
 2,
 array([[ 975.,  975., 1950.],
        [2025., 2025., 4050.]]))

Now we have very small p-value which means that we reject null hypothesis and that there is a relationship between the occurrence of outcomes for the two samples.

My question is: How is this possible?

It seems I don't understand something with the chi2 independence test. To me, the first two samples are clearly from the same distribution. Their p-value for chi2 independence test should be close to 0. For the second samples, they are similar but p-value should be much higher than for the first two samples.

According to my understanding of the chi2 independence test is that the p-value should be smaller if counts for the two samples are more similar and higher if there is more mismatch. Where am I wrong?

For more context, I want to apply chi2 independence test in a random split into test and control groups to test if the two groups are not statistically different with respect to a categorical variable. I generate splits into test and control groups until I find one that has p-value < 0.05 which should prove that the two groups are similar (with 95% significance level). Maybe chi2 independence test is not the right statistical tool for this?

How is what possible, exactly? Your understanding, by the way, is the opposite of what it should be: small p-values indicate lack of independence. Maybe the answers at https://stats.stackexchange.com/questions/31 will help you out here. — whuber, Apr 15 '22 at 15:30
@whuber Does it mean that if I perform a split into test and control groups and then compute chi2 independence test p-value for a contingency table, I should accept the split (with 95% significance level) as having no significant difference in terms of frequencies of tested categorical variable if p-value is larger than 0.95? — kostek, Apr 15 '22 at 15:56
You really need to read the answers in that p-value thread. Most people would accept the hypothesis (of independence) when p exceeds 0.05. But applying the chi-squared test in your application is merely a test of the random number generator; it wouldn't be advisable to use it to determine how to split your dataset. — whuber, Apr 15 '22 at 16:04

score 4 · Accepted Answer · answered Apr 15 '22 at 16:19

You seem to have confusion about the meaning of independence. Under your first example table where the second sample merely doubles the counts of the first, you write "They are clearly dependent and come from the same distribution." This statement is only half-right - the samples are drawn from the same distribution, but they are drawn independently of each other. The choice of sample has no bearing on the observed distribution, it is the same in both samples. The distribution is not dependent on which sample you observe, therefore the Sample variable is independent of the Outcome variable. The chi-squared p-value of 1 confirms this, by failing to reject the null hypothesis that the row and column variables are independent. In this case, Sample and Outcome are independent, you will not observe a different distribution of Outcomes no matter what Sample you pick.

score 0 · Answer 2 · answered Apr 15 '22 at 16:07

For Pearson's Chi squared test of independence, the null hypothesis is independence. A small p-value would give you evidence to reject this null hypothesis. Traditionally, p-values smaller than 0.05 have been treated as providing sufficient evidence to reject the null. However, a large p-value doesn't give you evidence to accept the null. That's simply not how hypothesis testing works. As an illustration, you could always ensure that you get a large p-value just by taking a very small sample.

In regards to your results, this is precisely what one should expect with statistical tests. For large samples like this, tests are relatively powerful so even a slight deviation from the expected cell count will be detected. So for your first example, the data matches the expected cell counts exactly. The next two examples are different enough to be incompatible with the null hypothesis of independence, i.e. if you actual sample a multinomial distribution with these expected values (given by the margins) it's very unlikely that you'd end up with data where the observed values are this different.

In terms of your use case, as long as the p-value is relatively large, then you lack evidence to reject the hypothesis of independent.

BruceET · Answer 3 · 2022-04-15T17:05:38.560

Maybe it will help to look at some data generated in a known way (using R). Suppose categories I, II, III occur in the proportion 1:2:4 before and after some event that one supposes might have changed the relationship between them.

set.seed(2022)
bef = sample(1:3, 2000, rep=T, p=c(1,2,4))
aft = sample(1:3, 3500, rep=T, p=c(1,2,4))
B = tabulate(bef);  B
[1]  284  586 1130
A = tabulate(aft);  A
[1]  535  973 1992
TAB = rbind(B,A);  TAB
  [,1] [,2] [,3]
B  284  586 1130
A  535  973 1992

The null hypothesis is homogeneity. That is, the proportions on rows A and B have essentially the same probabilities for the three categories.

If the agreement is good, the chi-squared statistic will be small, the P-value will be large, and we will not reject the null hypothesis.

Sometimes the chi-squared statistic is called a 'goodness-of-fit' statistic. That can be misleading. The better the fit the smaller the chi-squared statistic. (Maybe it should be called a 'badness-of-fit' statistic: the larger the statistic, the worse the fit.)

In R, here is Pearson's chi-squared test:

chisq.test(TAB)
    Pearson's Chi-squared test


data:  TAB
X-squared = 2.0562, df = 2, p-value = 0.3577

The expected counts are relatively near the observed counts in the matrix TAB.

chisq.test(TAB)$exp
      [,1]     [,2]     [,3]
B 297.8182 566.9091 1135.273
A 521.1818 992.0909 1986.727

Notice that expected count 297.82 is not far from observed count 284, and so on for the remaining five cells of the tables.

The first term of six making up the chi-squared statistic is $\frac{(284-297.8182)^2}{297.8182} = 0.6411383.$

Added together these six 'components of the statistic' sum to $2.0562.$

The degrees of freedom are $(r-1)(c-1) = (2-1)(3-1)=2,$ as noted in the output above. If the null hypothesis is true, we expect the chi-squared statistic to be near $2.$ Here, it is barely larger then $2.$

Below is a plot of the chi-squared density curve for 2 degrees of freedom, in which the vertical line shows the observed chi-squared statistic $2.0562.$ The area under the curve to the right of the vertical line is the P-value $0.3577.$ Because the P-value exceeds 5%, we do not reject the null hypothesis at the 5% level. That is, we have found no evidence that the two rows of TAB follow different probability models.

R code for figure:

curve(dchisq(x,2), 0, 10, lwd=2, col="blue",
  ylab="Density", xlab="chi-sq stat",
  main = "Density of CHISQ(df=2)")
 abline(v=0, col="green2")
 abline(h=0, col="green2")
 abline(v=2.0562, lwd=2)

P-value of chi-square test of independence

3 Answers3