Problems in chisq.test in R

Question

I have two groups of samples, NGHC (n=14) and NHC (n=87). The result of the samples CO.05 should be 0 or 1. For example, the results can be

        0     1
 NGHC  11     3
 NHC   87     0

This is a subset of my data frame,

df <- structure(list(CLASS = c("NGHC", "NGHC", "NGHC", "NGHC", "NGHC", 
"NGHC", "NGHC", "NGHC", "NGHC", "NGHC", "NGHC", "NGHC", "NGHC", 
"NGHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC", 
"NHC", "NHC", "NHC", "NHC", "NHC", "NHC", "NHC"), CO.05 = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-101L))

The cross-tabulation of the df

table(df$CLASS, df$CO.05)
        0
  NGHC 14
  NHC  87

when I try to calculate the chi-square of this data frame,

summary(table(df$CLASS, df$CO.05))

it returns

Number of cases in table: 101 
Number of factors: 2 
Test for independence of all factors:
Chisq = 2.2539e-31, df = 0, p-value = 0

It is should be a table of 2x2. Shouldn't the p-value = 1?

(Can I ask an additional question here? If not I will delete this Question: Since the samples of these two groups are imbalanced (14 vs 87), is chi-square the correct statistics method to compare the significance between these two groups?)

Since this question seems to be about the value of the p-value, this might be better asked at [stats.se]. It's not clear exactly to me what hypothesis you are testing in if you have have a 2x1 table — MrFlick, Jun 30 '22 at 13:35
You are testing the null hypothesis that the counts in the two groups are the same, NGHC == NGH. They are not even close so you reject the null hypothesis. — dcarlson, Jun 30 '22 at 13:50
Why do you think the p-value should be 1? Also the title mentions chisq.test() but you're using summary() instead. — Caspar V., Jun 30 '22 at 13:56
If you claim a p-value should be different from what you're getting you should provide an explanation. It looks exactly like I would expect and I don't know why you would assume otherwise. — Dason, Jun 30 '22 at 13:56
Sorry I was trying to create a 2 x 2 table. The output of NGHC and NHC should be 0 or 1. — pill45, Jun 30 '22 at 14:01
Please do not change the question in the comments. Edit the question to include a more suitable example data and explain why you expected a different $p$-value then the one you got. — Bernhard, Jun 30 '22 at 14:03
Is this maybe a more appropriate example data? set.seed(2); df2<-data.frame(class=sample(c("NGHC","NHC"),100,T),CO.05=sample(0:1,100,T)) You can do table(df2) and summary(table(df2)) as well as chisq.test(table(df2), correct = FALSE) with it. — Bernhard, Jun 30 '22 at 14:07
My question is: Will samples belonging to NGHC will have more CO.05 = 1, than NHC. Since in this data, NGHC did not even have any CO.05 =1, I thought that the p.value should be close to 1. — pill45, Jun 30 '22 at 14:21
Please type ?chisq.test and read the documentation. At "Details" it states, "If x is a matrix with one row or column, or if x is a vector and y is not given, then a goodness-of-fit test is performed (x is treated as a one-dimensional contingency table). The entries of x must be non-negative integers. In this case, the hypothesis tested is whether the population probabilities equal those in p, or are all equal if p is not given." — whuber, Jun 30 '22 at 15:25

dipetkov · Accepted Answer · 2022-06-30T20:55:22.070

First things first. You seem to be new to both statistics and R. In the long run you'll be more effective at learning both if you start with the basics. A free resource I like is Modern Statistics with R.

Now to your question. There are no problems with chisq.test function, you have data issues and gaps in your knowledge.

You expect a 2✕2 table but get a 2✕1 table because the CO.05 column in your data frame contains only zero. Where are the three 1s you expect? We don't know and it's up to you to investigate what happened.

# Two ways to create a contingency table from a matrix:
table(df)
#>       CO.05
#> CLASS   0
#>   NGHC 14
#>   NHC  87
xtabs(~ CLASS + CO.05, data = df)
#>       CO.05
#> CLASS   0
#>   NGHC 14
#>   NHC  87

If you know what the 2✕2 table is, you can create it by hand.

# Create a 2x2 table from a list of 4 values.
# Use the `byrow` argument to specify that the 4 values are ordered by column.
table2x2 <- matrix(
  c(11, 87, 3, 0),
  nrow = 2, ncol = 2,
  byrow = FALSE
)
table2x2
#>      [,1] [,2]
#> [1,]   11    3
#> [2,]   87    0

Hard-coding the table is error-prone; it's better to contract the contingency table from data. But in this case you seem to not have the right data.

Perform chi-square test:
chisq.test(table2x2)
#> Warning in chisq.test(table2x2): Chi-squared approximation may be incorrect
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  table2x2
#> X-squared = 12.498, df = 1, p-value = 0.0004074

The p-value is 0.0004, not 1. I'll just point out but not explain three important details:

There is a warning that the chi-squared approximate might not hold. (This warning can come up when there are very small counts; see reference below as well as the comments by @whuber and @NuclearHoagie.)
The reported p-value is approximate, not exact. As @whuber points out, the approximation is not very close.
The test was performed with Yates' continuity correction.

These details are important but more advanced; it's better to learn the basics first. And definitely don't proceed with further analyses if there are warnings in the output.

It's not clear why you think the p-value should be 1; it's best to avoid hunches about p-values altogether. This and the fact you ask how to "compare the significance between these two groups" suggests that you don't understand p-values and hypothesis testing. I suggest you learn more about both concepts before trying to draw conclusions from statistical tests. Here are two CV posts you could start with:

What is the meaning of p values and t values in statistical tests?
Find Statistical Significance of Binary Data
Warning in R - Chi-squared approximation may be incorrect

PS. The title of the second question is unfortunate because data is neither significant nor insignificant. It's whatever we observe it to be.

+1. But be a little careful to heed the warning in the output: a correct p-value is over five times the reported one. — whuber, Jun 30 '22 at 20:17
@whuber Thank you for the suggestion. I've updated the answer to point out the approximation and just as important to not ignore the warning. — dipetkov, Jun 30 '22 at 20:36
The reason for the warning is not that there are two small counts, but that there are two small expected counts under the null hypothesis. For instance, with the table $$\pmatrix{20 & 0\3 & 25}$$ you won't get a warning, despite having the same two small counts of $3$ and $0.$ (Nor is any warning needed.) — whuber, Jun 30 '22 at 20:38
One minor note, the chi-squared approximation fails when the expected, rather than the observed counts are small. Having cells with observed values of 0 and 3 isn't inherently a problem, but the fact that they're in the same column makes the expected count for the "1" group too small. If the 0 and 3 values were along the diagonal in different rows and different columns, chi-squared would still be a good approximation. You should usually double check with small observed counts, but they don't always imply small expected counts. — Nuclear Hoagie, Jun 30 '22 at 20:47
@whuber NuclearHoagie Of course you are correct but it also seems like a complex point to explain given where this question started. I compromised by saying that the warning can come up when there are small counts and linking to a relevant CV post. — dipetkov, Jun 30 '22 at 20:53
It's no more complicated to get the reason right than to get it wrong! — whuber, Jun 30 '22 at 21:21

Problems in chisq.test in R

1 Answers1