0

I have some data which has less than 5 elements in a cell, which would usually lead to fishers test being used instead. However, I'm unable to use fishers as there's not enough memory for it.

chi squared output

> chisq.test(df$x, df$y)

  Pearson's Chi-squared test

data:  df$x and df$y
X-squared = 21.191, df = 7, p-value = 0.003498

Warning message:
In chisq.test(df$x, df$y) : Chi-squared approximation may be incorrect

fishers output

> fisher.test(df$x, df$y)
Error in fisher.test(df$x, df$y) : 
  FEXACT error 7(location). LDSTP=18540 is too small for this problem,
  (pastp=50.654, ipn_0:=ipoin[itp=57]=2276, stp[ipn_0]=49.9046).
Increase workspace or consider using 'simulate.p.value=TRUE'

The Chi Squared test reports $p<0.01$, is there any way to tell how incorrect this could be? I mean, is there an upper bound or something for the amount of error that could occur given this data.

Here's a table of the data

    A   B   C   D   E   F   G   H
0  68   8  66  21  10  16   2   1
1 239   7 192  35  18  21   1   1

The expected values are :

    A   B   C   D   E   F   G   H
0  83   4  70  15   8  10   1   1
1 224  11 188  41  20  27   2   1

There are 5 cells above containing values less than 5.

The difference between the observed and expected is

  A  B  C  D  E  F  G  H
0 15  4  4  6  2  6  1  0
1 15  4  4  6  2  6  1  0

so

I wanted to test for association with the given data, but there are multiple cells with expected values less than 5. Fishers isn't an option because of computation, I'm not sure what's typically done in this situation.

code

Sorry about the dput() mess

df <- structure(list(x = structure(c(1L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 1L, 1L, 3L, 1L, 1L, 3L, 6L, 1L, 1L, 1L, 3L, 1L, 1L, 4L, 3L, 3L, 4L, 1L, 1L, 5L, 5L, 4L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 1L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 6L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 8L, 3L, 3L, 4L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 4L, 3L, 4L, 3L, 4L, 1L, 3L, 1L, 1L, 4L, 1L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 4L, 1L, 3L, 6L, 3L, 3L, 1L, 1L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 4L, 3L, 1L, 1L, 1L, 6L, 4L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 1L, 1L, 5L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 6L, 3L, 1L, 1L, 4L, 3L, 4L, 4L, 3L, 3L, 1L, 1L, 6L, 1L, 1L, 3L, 8L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 5L, 5L, 1L, 5L, 1L, 1L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 6L, 3L, 1L, 1L, 3L, 1L, 4L, 1L, 1L, 1L, 1L, 6L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 4L, 4L, 1L, 3L, 1L, 3L, 3L, 4L, 1L, 3L, 5L, 3L, 1L, 3L, 3L, 5L, 1L, 3L, 1L, 1L, 3L, 5L, 4L, 1L, 1L, 1L, 1L, 6L, 3L, 3L, 6L, 1L, 6L, 1L, 1L, 1L, 1L, 1L, 5L, 1L, 1L, 1L, 5L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 7L, 1L, 4L, 5L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 1L, 1L, 4L, 1L, 1L, 4L, 1L, 3L, 6L, 3L, 3L, 2L, 1L, 3L, 1L, 2L, 6L, 3L, 4L, 3L, 4L, 1L, 6L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 6L, 3L, 3L, 1L, 1L, 3L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 6L, 3L, 1L, 1L, 3L, 1L, 1L, 6L, 3L, 3L, 3L, 1L, 3L, 1L, 4L, 6L, 1L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 6L, 1L, 3L, 1L, 3L, 3L, 4L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 6L, 6L, 4L, 4L, 1L, 3L, 2L, 5L, 3L, 2L, 2L, 2L, 3L, 4L, 3L, 6L, 3L, 3L, 3L, 4L, 1L, 3L, 1L, 6L, 4L, 1L, 3L, 3L, 3L, 1L, 5L, 2L, 6L, 3L, 1L, 1L, 3L, 3L, 5L, 3L, 1L, 2L, 1L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 6L, 4L, 3L, 3L, 1L, 1L, 1L, 6L, 1L, 3L, 6L, 1L, 1L, 6L, 5L, 1L, 1L, 1L, 1L, 3L, 3L, 4L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 4L, 1L, 4L, 1L, 3L, 3L, 1L, 6L, 4L, 4L, 1L, 3L, 4L, 3L, 1L, 3L, 5L, 3L, 1L, 3L, 5L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 3L, 3L, 4L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 1L, 3L, 3L, 3L, 1L, 1L, 4L, 1L, 1L, 6L, 1L, 2L, 6L, NA, 6L, 7L, 1L, 3L, 6L, 1L, 3L, 3L, 7L, 4L, 1L, 4L, 1L, 1L, 2L, 3L, 4L, 3L, 1L, 4L, 1L, 3L, 1L, 4L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 5L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 5L, 3L, 1L, 1L, 5L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 3L, 5L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 5L, 1L, 1L, 1L, 4L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 1L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 6L, 1L, 1L, 3L, 3L, 4L, 4L, 1L, 1L, 1L, 5L, 3L, 3L, 3L, 4L, 1L, 2L, 3L, 2L, 6L, 2L, 3L, 5L, 1L, 3L, 3L, 3L, 3L, 1L, 4L, 3L, 1L, 4L, 1L, 3L, 1L, 5L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H"), class = "factor"), y = c(1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L)), row.names = c(NA, -707L), class = "data.frame") 

# this puts out a warning message
chisq.test(df$x, df$y)
# trying fishers
fisher.test(df$x, df$y)

expected values code

This was used to get the table of expected values

m = as.matrix(table(df$y, df$x))
m2 = m*0
for(row in 1:dim(m)[1]){
  for (col in 1:dim(m)[2]){
    expected = (sum(m[row,]) * sum(m[,col])) / sum(m)
    m2[row,col] = expected
  }
}
baxx
  • 936
  • The output from fisher.test gave you some hints as to what to do next. What happened when you tried them? Note also that chisq.test only said may be incorrect. – mdewey Mar 02 '19 at 16:08
  • @mdewey thanks, using simulate returns a test without error, which is significant. Re your point about it only saying it may be incorrect, I'm not sure how to establish whether or not that's the case, which I tried to highlight in the question about upper bounds to that. – baxx Mar 02 '19 at 16:13
  • I would suggest that if the Fisher gives the same substantive conclusion it implies the $\chi^2$ was OK. You could always compute (O-E)^2/E for each cell to spot any very large contributions to teh total. – mdewey Mar 02 '19 at 16:37
  • 2
    This Q&A may also be helpful https://stats.stackexchange.com/questions/93212/what-to-do-when-i-have-expected-count-5-warning-for-a-chi-squared-test?rq=1 – mdewey Mar 02 '19 at 16:38
  • The chi-squared test will be fine for these data, because the vast majority of cells have expectations of $5$ or more. – whuber Mar 02 '19 at 17:31
  • @whuber thanks, what's the threshold for "vast majority"? I mean in general. If there's a 2x2 table with one cell <5 then this would be 0.25 of the cells <5, I'm not sure if there's a rule of thumb for this though. – baxx Mar 02 '19 at 17:35
  • The threshold is that the chi-squared distribution should be a good approximation to the chi-squared statistic under the null hypothesis. The answer depends on how good is "good," but there are general patterns. One is that as the number of cells increases, it starts to matter very little what their expectations might be. Generally, when testing at a 5% or 1% level, you're fine even with 20% low-expectation cells. When in doubt, run a small simulation with your data. R has that built right in to its chi-squared test function. – whuber Mar 02 '19 at 17:44

0 Answers0