3

I'm trying to perform Pearson Chi-square test on goodness of fit. Here is an example of fitting a Poisson distribution:

data <- rpois(200,50)
estimate <- mean(data)
freq.os<-table(data)
yfit <- dpois(as.integer(names(freq.os)), estimate)

chisq.test(x = freq.os, p = yfit)
# Error in chisq.test(x = freq.os, p = yfit) : probabilities must sum to 1.

When I evaluate sum(yfit), I get 0.999839.

So how to produce a set of probability values that add up to 1? Thank you!


EDIT

Actually I found a work around:

chisq.test(freq.os, yfit)

but I am very confused as to how chisq.test() works as it is telling me df = 429. I thought df = n - k - 1, which in this case should be 35, where k = 1 for the lambda and n = number of terms in the chi-square sum. Where am I doing wrong?

Zheyuan Li
  • 62,170
  • 17
  • 162
  • 226
Jerry W
  • 165
  • 1
  • 2
  • 10
  • Really need a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), but in general the way to get a vector / number to sum to one is to divide each term by the sum. Example, `x = c(0.1, 0.2, 0.3) ; y = x/sum(x) ; sum(y)` – user20650 Jul 08 '16 at 22:40
  • It is a vector of empirical frequencies: 4 14 41 73 179 307 516 822 1188 1547 1950 2314 2534 2607 2647 2652 2245 2026 1608 1384 1006 732 518 383 239 157 84 61 29 13 12 4 2 1 1. Ok, I'm definitely doing something wrong here, I misunderstood the use of this function and thought that p= expected frequency given the hypothesized distribution. How do I use this function given that I have a vector of expected and empirical frequencies? – Jerry W Jul 09 '16 at 00:58

1 Answers1

5

Comments above suggested you to manually rescale yfit, before passing it to chisq.test. Well actually you can let chisq.test() do this for you:

chisq.test(x = freq.os, p = yfit, rescale.p = TRUE)

Regarding your Edit

chisq.test(freq.os, yfit)

is incorrect, as it is doing independence test.

chisq.test() can perform two statistical test:

  1. goodness of fit test, using argument x and p;
  2. independence test, using argument x and y.

Please read ?chisq.test carefully. For goodness of fit test, you must use p argument in the function, as you initially did. My answer above, using rescale.p = TRUE will help you get around the error you saw.


How to do Pearson Chi-square test

You said you don't know how the test is done, then read this part.

You should use set.seed(), so that when others run your code, they get the same random numbers as you got. Here is a reproducible example:

N <- 200    ## number of samples
set.seed(0)    ## fix random seed for reproducibility
x <- rpois(N,50)    ## generate sample
lambda <- mean(x)    ## estimate mean

Now we aim to use Pearson Chi-square test to test goodness of fit. We have

Null Hypothesis: x ~ Poisson(lambda)

O <- table(x)    ## contingency table / observed frequency
n <- length(O)    ## number of categories
# 31
x0 <- as.integer(names(O))    ## unique sample values
p <- dpois(x0, lambda); p <- p / sum(p)    ## theoretical probability under Null Hypothesis

Let's first perform Chi-square test ourselves.

E <- p * N    ## expected frequency under Null Hypothesis
R <- (O - E) / sqrt(E)    ## standardised residuals
X <- sum(R * R)    ## Chi-square statistic
# 36.13962
dof <- n - 1    ## degree of freedom
# 30
p.value <- 1 - pchisq(X, df = dof)    ## p-value
# 0.2035416

The p-value is larger than 0.05, hence we can not reject Null Hypothesis. Summary plot:

z <- curve(dchisq(x, df = dof), from = 0, to = 100)
abline(v = X, lty = 3)
polygon(c(X, X, z$x[z$x > X]), c(0, dchisq(X, df = dof), z$y[z$x > X]), col = 2)

enter image description here

Then we use chisq.test()

chisq.test(O, p = p)

#Chi-squared test for given probabilities

#data:  O
#X-squared = 36.14, df = 30, p-value = 0.2035

You can see, the result is the same.

Zheyuan Li
  • 62,170
  • 17
  • 162
  • 226
  • Thank you, your suggestion worked. But I am still confused as to how `chisq.test` works, I have added the reproducible example. I tried to go through the source code, but because I am new to R and programming in general, everything looks very overwhelming – Jerry W Jul 09 '16 at 13:55
  • Thank you very much. I remember reading that dof should be n - k - 1 where k is the number of estimated parameter. Since you estimated the lambda value, shouldn't the degree of freedom by 29 instead of 30? – Jerry W Jul 11 '16 at 16:48