I have a 3-sided (fair) die with sides x1, x2, x3. I roll the die 3000 times and tally the results: x1 * 990, x2 * 1050, x3 * 960. A quick chi-square test reveals that the probability of getting this result is ~12%:
chisq.test(x = c(990,1050,960), p = c(1,1,1), rescale.p = TRUE)$p.value
# 0.122456428252982
Now, I want to reproduce this p-value via simulation. Here are my steps:
- Create simulation:
library(tidyverse)
set.seed(123)
x <- replicate(10000, sample(1:3, size=3000, replace = TRUE))
df <- data.frame(x1=colSums(x==1, na.rm=TRUE), x2=colSums(x==2, na.rm=TRUE), x3=colSums(x==3, na.rm=TRUE))
df %>% head(3)
x1 x2 x3
<dbl> <dbl> <dbl>
#1 979 978 1043
#2 974 1042 984
#3 1017 966 1017
- Find the probability that the tally for x1 is as extreme or more as 990:
p_990 = df %>% summarise(count=sum(abs(1000-x1)>=10) / 10000) %>% pull()
p_990
0.711
- Find the probability that the tally for x2 is as extreme or more as 1050:
p_1050 = df %>% summarise(count=sum(abs(1000-x2)>=50)/ 10000) %>% pull()
p_1050
0.0596
- Take the product of both probabilities:
p_990 * p_1050
0.0423756
This is not the ~12% I was expecting to obtain - where am I going wrong?
Ryou can simulate the null distribution just by specifying thesimulate.p.valueargument ofchisq.test. The basic issue with your example is that it doesn't even compute the chi-squared statistic, so perhaps looking that up would be the place to start. – whuber May 02 '22 at 20:39n=90;css=replicate(10000,chisq.test(table(sample(3,n,replace=TRUE)))$statistic);plot(ecdf(css),do.points=FALSE,verticals=TRUE);f=function(x) pchisq(x,2);curve(f,0,15,col=2,add=TRUE)– Glen_b May 03 '22 at 04:47n=12,n=30andn-300for comparison. The warnings atn=12will be because the expected counts are on the low side – Glen_b May 03 '22 at 04:53mean(css>qchisq(.95,2)). For a 5% test it's pretty poor at n=15 (no warnings though!), not too bad at n=30. Calculate the standard error of the estimate along with it though. – Glen_b May 03 '22 at 04:58