0

I'm essentially trying to reproduce the dhyper function in R using
!n ~ gamma(n+1). Why does this approach not work for large values of the population?

# (n choose k)
fastChoose <- function(parm_n, parm_k) {
   p1 = lgamma(parm_n+1)
   p2 = lgamma(parm_n - parm_k +1)
   p3 = lgamma(parm_k + 1)
   result = p1 - p2 - p3
   return (exp(result))
}

# P(X=k) = (K choose k) * (N-K choose K-k) / (N choose n)
myDHyper <- function(parm_k, parm_K, parm_N, parm_n){
   return (fastChoose(parm_K,parm_k) * fastChoose(parm_N-parm_K,parm_K - parm_k) /     fastChoose(parm_N,parm_n)) 
}

BIGN<-15000
whiteBallsInPop <- 50
iSampled<- 200
allPointMass<-c()
allPointMassMine <- c()
for(i in seq(0,whiteBallsInPop)){

  fromDHyper <- dhyper(i, whiteBallsInPop, BIGN-whiteBallsInPop, iSampled, log = FALSE)
  allPointMass <- c(allPointMass, fromDHyper)

  fromMyDHyper <- myDHyper(i,whiteBallsInPop,BIGN,iSampled)
  allPointMassMine <- c(allPointMassMine,fromMyDHyper )

} 

sum(allPointMass)
sum(allPointMassMine)

Keeping everything on log scale and taking log(p) doesn't seem to work either:

#log( P(X=x))
myDHyperOverflowFixed <- function(k, K, N, n){
  # P(X=k) = (K choose k) * (N-K choose K-k) / (N choose n)
  #        = K!/k!(K-k)! * (N-K)!/(K-k)!(N-K-K+k)! /  N!/n!(N-n)!

   log_p <- lgamma(K + 1) +
        lgamma(N - K + 1) +
        lgamma(n + 1) +
        lgamma(N - n + 1) - 
        lgamma(k + 1) - 
         lgamma(K - k + 1) - 
         lgamma(K - k + 1) - 
         lgamma(N - K - K + k + 1) -
         lgamma(N + 1)

 return (log_p)
}
Glen_b
  • 282,281
SOUser
  • 205
  • 1
    Your new code multiplies $\binom{K}{k}\binom{N-K}{K-k}$ by $\binom{N}{n}$ rather than dividing by it. – whuber Jul 10 '14 at 13:14
  • I don't think this is true. – SOUser Jul 10 '14 at 15:43
  • 1
    You're right; I missed some minus signs. You can make your code much easier to check, though, by structuring it suitably. For instance, it would be attractive to define a log.choose function as log.choose <- function(n,k) lgamma(n+1)-lgamma(k+1)-lgamma(n-k+1) and then invoke it thrice: myDHyperOverflowFixed <- function(k,K,N,n) log.choose(K,k)+log.choose(N-K,K-k)-log.choose(N,n). That kind of programming is more readable, far easier to test and debug, and limits the effects of sly little typographical errors that can plague mathematical code like this. – whuber Jul 10 '14 at 18:05

1 Answers1

2

A side issue: $n!$ is equal to $\Gamma(n+1)$. This is not an approximation but an exact result. The only approximation is arguably in the calculation of $\log\Gamma(n+1)$, but the calculation is quite accurate for positive arguments in the region you're looking at.

Your problem is overflow in fastChoose

With such large numbers the value in the denominator of the calculations in myDHyper is Inf so the results are always 0.

> fastChoose(15000,200)
[1] Inf

You should instead rewrite myDHyper to work entirely on the log scale until the very last step. That should work well enough at these numbers.

(At very large values of the parameters, other strategies may be needed.)

Glen_b
  • 282,281
  • Hi, I've added another function trying to keep everything on log scale 'for the entire P(X=x) but still this returns large numbers and not probs – SOUser Jul 10 '14 at 11:15
  • Where does it get exponentiated? – Glen_b Jul 10 '14 at 11:59
  • Hi,there was a copy-paste error i noted above (log should have been lgamma) but that doesn't fix the problem. I exp() when I return the log_p – SOUser Jul 10 '14 at 12:33
  • Did you try your new code on some simple cases where you can compute log_p by hand? Did you check some cases where your old code worked? – Glen_b Jul 10 '14 at 13:21
  • e.g. compare log(myDHyper(0,1,3,2)) with the equivalent in your new code. – Glen_b Jul 10 '14 at 13:35