To make the question more understandable I will use a reproducible example.
I have count data, how many connections different groups share with a unique group. In my case I have an upper bound of 7224 connections. Also the number of trials are FIXED to 7224 trials.
counts = c(140, 267, 102, 261, 112, 168, 73, 163, 84, 206, 158, 92, 142, 178, 176, 133, 174, 140, 51, 318, 242, 59, 84, 123, 157, 239, 142, 65, 110, 96, 118, 171, 104, 121,
134,123, 100, 84, 233, 244, 192, 44, 105, 211, 200, 220, 103, 197, 97, 75, 176, 201, 147, 171, 164, 147, 75, 230, 184, 135)
I check for overdispersion using Binomial or Poisson distributions:
dat = as.data.frame(cbind(c(1:60),rep(7224,60),counts))
names(dat) = c("quad","group_size","count")
attach(dat)
binom.tas=glm(dat[,3]/group_size~1,
family=binomial,
weights=group_size)
The summary for the binomial gives us:
> summary(binom.tas)
Call:
glm(formula = dat[, 3]/group_size ~ 1, family = binomial, weights = group_size)
Deviance Residuals:
Min 1Q Median 3Q Max
-10.2331 -4.0805 -0.6127 2.9193 12.1415
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.85798 0.01067 -361.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1466.8 on 59 degrees of freedom
Residual deviance: 1466.8 on 59 degrees of freedom
AIC: 1873.1
Number of Fisher Scoring iterations: 4
As I understood reading the questions (i.e.here), if our model fits the data well, the ratio of the Deviance to DF, Value/DF, should be about one. In my case is (1466.8/59) >> 1. We can consider overdispersion.
I also tried with Poisson:
rd <- glm(counts ~ 1,family = poisson)
> dispersiontest(rd,trafo=1)
Overdispersion test
data: rd
z = 5.6057, p-value = 1.037e-08
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha
22.8386
Null and alternative hypotheses:
$$H_0:θ=p$$ $$H_a:θ>p.$$ The test statistic we will use is based on the binomial distribution; i.e., if $X$ is the number of connections observed in $n$ trials, then $$X∣H_0∼Binomial(n,θ=p).$$
I do not know $p$ but I estimate it through many random samples (here I have only 60 but I use 1000). Where:
$$ E(x) = np $$
Both, this and glm give the same estimated $p$.
Then, if we observe $X=300$ connections in $n=7224$ trials, then the p-value corresponding to this test statistic would be straight forward to calculate.
probability = sum(dbinom(300:7224,7224,binom.tas$fitted.values[1]))
However, I know that Binomial distribution does not really fit my data properly since it does not deal with overdispersion, same happens with Poisson. I cannot use Negative binomial because I am dealing with an upper bound. Therefore, the best I could find was the Beta binomial distribution.
The problem for me arises here. As far as I understood Poisson or Binomial are frequentist approaches. In the case of Binomial we are assuming same probability $p$ for all the Bernoulli trials. On the other hand, with beta-binomial distribution the probability of success at each trial is not fixed but random and follows the beta distribution.
To obtain the optimal Beta-binomial parameters I did the following, using this:
# In order to obtain the estimates for alpha and beta, the
# log-likelihood function needs to be defined.
lbetabin = function (data, inits) {
n=data[,1] # This corresponds to the group_size
r=data[,2] # This corresponds to the data on incidence
alpha=inits[1] # This corresponds to the initial
# starting parameter for alpha
beta=inits[2] # This corresponds to the initial
# starting parameter for beta
# Because optim minimizes a function, the
# negative log-likelihood is used
# Also, the factorial is not necessary, as it does not depend
# on the parameters of interest
sum(-lgamma(alpha+r)-lgamma(beta+n-r)+lgamma(alpha+beta+n)
+lgamma(alpha)+lgamma(beta)-lgamma(alpha+beta))
}
I define my prior for the Beta parameters as:
# For alpha, use the estimated p from the binomial
# and for beta, use 1-p
inits=c(binom.tas$fitted.values[1],1-binom.tas$fitted.values[1])
optim.tas=optim(inits, lbetabin,
hessian=T, data=dat[,2:3])
> optim.tas
$`par`
1 1
6.096481 288.555568
$value
[1] 43019.25
$counts
function gradient
77 NA
$convergence
[1] 0
$message
NULL
$hessian
1 1
1 10.0387151 -0.195681423
1 -0.1956814 0.004256435
Once I have the beta parameters, I use the dbb function from the R package TailRank to obtain the PMF of the theoretical Beta Binomial distribution that fits the data best.
Therefore, the probability of observing at least 300 connections is:
probability = sum(dbb(300:7224,7224, optim.tas$par[1], optim.tas$par[2]))
- Is this doable?
- Could this still be considered a p value?
- Is there maybe, another frequentist approach that can deal with overdispersion?
- As suggested by sega_sai, maybe a truncated negative binomial even if I have a fixed number of trials or proportion count data?