Hypothesis testing with beta binomial. Dealing with overdispersion

Question

To make the question more understandable I will use a reproducible example.

I have count data, how many connections different groups share with a unique group. In my case I have an upper bound of 7224 connections. Also the number of trials are FIXED to 7224 trials.

counts = c(140, 267, 102, 261, 112, 168, 73, 163,  84, 206, 158,  92, 142, 178, 176, 133, 174, 140,  51, 318, 242,  59,  84, 123, 157, 239, 142,  65, 110,  96, 118, 171, 104, 121,
134,123, 100,  84, 233, 244, 192,  44, 105, 211, 200, 220, 103, 197,  97,  75, 176, 201, 147, 171, 164, 147,  75, 230, 184, 135)

I check for overdispersion using Binomial or Poisson distributions:

dat = as.data.frame(cbind(c(1:60),rep(7224,60),counts))

names(dat) = c("quad","group_size","count")
attach(dat)

binom.tas=glm(dat[,3]/group_size~1,
          family=binomial,
          weights=group_size)

The summary for the binomial gives us:

   > summary(binom.tas)

Call:
glm(formula = dat[, 3]/group_size ~ 1, family = binomial, weights = group_size)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-10.2331   -4.0805   -0.6127    2.9193   12.1415  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.85798    0.01067  -361.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1466.8  on 59  degrees of freedom
Residual deviance: 1466.8  on 59  degrees of freedom
AIC: 1873.1

Number of Fisher Scoring iterations: 4

As I understood reading the questions (i.e.here), if our model fits the data well, the ratio of the Deviance to DF, Value/DF, should be about one. In my case is (1466.8/59) >> 1. We can consider overdispersion.

I also tried with Poisson:

rd <- glm(counts ~ 1,family = poisson)
> dispersiontest(rd,trafo=1)

    Overdispersion test

data:  rd
z = 5.6057, p-value = 1.037e-08
alternative hypothesis: true alpha is greater than 0
sample estimates:
  alpha 
22.8386

Null and alternative hypotheses:

$$H_0:θ=p$$ $$H_a:θ>p.$$ The test statistic we will use is based on the binomial distribution; i.e., if $X$ is the number of connections observed in $n$ trials, then $$X∣H_0∼Binomial(n,θ=p).$$

I do not know $p$ but I estimate it through many random samples (here I have only 60 but I use 1000). Where:

$$ E(x) = np $$

Both, this and glm give the same estimated $p$.

Then, if we observe $X=300$ connections in $n=7224$ trials, then the p-value corresponding to this test statistic would be straight forward to calculate.

probability = sum(dbinom(300:7224,7224,binom.tas$fitted.values[1]))

However, I know that Binomial distribution does not really fit my data properly since it does not deal with overdispersion, same happens with Poisson. I cannot use Negative binomial because I am dealing with an upper bound. Therefore, the best I could find was the Beta binomial distribution.

The problem for me arises here. As far as I understood Poisson or Binomial are frequentist approaches. In the case of Binomial we are assuming same probability $p$ for all the Bernoulli trials. On the other hand, with beta-binomial distribution the probability of success at each trial is not fixed but random and follows the beta distribution.

To obtain the optimal Beta-binomial parameters I did the following, using this:

# In order to obtain the estimates for alpha and beta, the 
# log-likelihood function needs to be defined. 

lbetabin = function (data, inits) {
  n=data[,1] # This corresponds to the group_size
  r=data[,2] # This corresponds to the data on incidence
  alpha=inits[1] # This corresponds to the initial
  #    starting parameter for alpha
  beta=inits[2] # This corresponds to the initial
  #    starting parameter for beta
  # Because optim minimizes a function, the
  #    negative log-likelihood is used
  # Also, the factorial is not necessary, as it does not depend
  #    on the parameters of interest
  sum(-lgamma(alpha+r)-lgamma(beta+n-r)+lgamma(alpha+beta+n)
      +lgamma(alpha)+lgamma(beta)-lgamma(alpha+beta))
}

I define my prior for the Beta parameters as:

# For alpha, use the estimated p from the binomial
#    and for beta, use 1-p
inits=c(binom.tas$fitted.values[1],1-binom.tas$fitted.values[1])

optim.tas=optim(inits, lbetabin,
                hessian=T, data=dat[,2:3])
> optim.tas
$`par`
         1          1 
  6.096481 288.555568 

$value
[1] 43019.25

$counts
function gradient 
      77       NA 

$convergence
[1] 0

$message
NULL

$hessian
           1            1
1 10.0387151 -0.195681423
1 -0.1956814  0.004256435

Once I have the beta parameters, I use the dbb function from the R package TailRank to obtain the PMF of the theoretical Beta Binomial distribution that fits the data best.

Therefore, the probability of observing at least 300 connections is:

probability = sum(dbb(300:7224,7224, optim.tas$par[1], optim.tas$par[2]))

Is this doable?
Could this still be considered a p value?
Is there maybe, another frequentist approach that can deal with overdispersion?
As suggested by sega_sai, maybe a truncated negative binomial even if I have a fixed number of trials or proportion count data?

I am not sure why can't you fit the negative binomial model with the truncation... The NB distribution is clearly the most appropriate here. And to deal with truncation you just need to normalize the likelihood by the integral over the truncated range. — sega_sai, Dec 01 '18 at 17:51
We should also consider that I am having a fixed number of trials in each experiment. — user219773, Dec 01 '18 at 18:50
Basically, if you are fitting the distribution of data by a distribution that is truncated you write the likelihood function as $L(\phi; x) = I(x) P(x|\theta)/\int_{x_0}^{x_1} P(x|\theta) dx$ where $x_0$, $x_1$ denotes the region of your truncation and $I(x)$ is an indicator which is 1 if x is within the range, and zero outside — sega_sai, Dec 01 '18 at 19:04
Ok, I think I get it, I was surprised because reading many questions regarding NB, they were all saying that using NB for proportion count data and/or bounded data is not appropiate. — user219773, Dec 01 '18 at 20:28

Hypothesis testing with beta binomial. Dealing with overdispersion

0 Answers0