TL;DR: Which $N$ is correct for BIC in logistic regression, the aggregated binomial or Bernoulli $N$?
UPDATES AT BOTTOM
Suppose I have a data set to which I'd like to apply logistic regression. For the sake of example, suppose there are $j=5$ groups with $m=100$ participants each, for a total $n=500$. The outcome is 0 or 1. For example, the following data set (R code):
library(dplyr)
library(tidyr)
set.seed(45)
d <- tibble(y = rbinom(500, 1, .5),
x = factor(rep(LETTERS[1:5], each = 100)))
There are two ways I can represent this: as is, above, treating every observation as a Bernoulli random variable, or aggregating observations within groups and treating each observation as Binomial. The number of rows in the data set will be 500 in the first instance, and 5 in the second.
I can construct the aggregated data set:
d %>%
group_by(x, y) %>%
summarise(n = n()) %>%
spread(y, n) %>%
rename(f = `0`, s = `1`) %>%
mutate(n = s + f) -> d_agg
I can then fit the logistic regression using both data sets in R:
g_bern <- glm(y ~ x, data=d, family=binomial)
g_binom <- glm(cbind(s,f) ~ x, data=d_agg, family=binomial)
UPDATE 2: We now fit the intercept only models:
g_bern0 <- glm(y ~ 1, data=d, family=binomial)
g_binom0 <- glm(cbind(s,f) ~ 1, data=d_agg, family=binomial)
and compute the AIC:
> AIC(g_bern)
# [1] 694.6011
> AIC(g_binom)
# [1] 35.22172
which of course differ by a constant
2*sum(lchoose(d_agg$n, d_agg$s)) # [1] 659.3794
as expected (see: Logistic Regression: Bernoulli vs. Binomial Response Variables).
However, the BICs differ by that constant AND a factor that depends on the "number of observations", and the number of observations differ in each:
> BIC(g_bern)
# [1] 715.6742
> BIC(g_binom)
# [1] 33.26891
> nobs(g_bern)
# [1] 500
> nobs(g_binom)
# [1] 5
Just to confirm, we can recalculate BIC for both:
> -2*logLik(g_bern) + attr(logLik(g_bern),"df")*log(nobs(g_bern))
# 'log Lik.' 715.6742 (df=5)
> -2*logLik(g_binom) + attr(logLik(g_binom),"df")*log(nobs(g_binom))
# 'log Lik.' 33.26891 (df=5)
and indeed the only place these two numbers differ is $N$.
UPDATE 2: When we try to assess the factor x, we see a disagreement that is ONLY attributable to the number of observations:
> BIC(g_bern0) - BIC(g_bern)
# [1] -17.66498
> BIC(g_binom0) - BIC(g_binom)
# [1] 0.7556999
UPDATE 2: As expected, the AICs are consistent:
> AIC(g_bern0) - AIC(g_bern)
# [1] -0.8065485
> AIC(g_binom0) - AIC(g_binom)
# [1] -0.8065485
This surprises me, since I would think that R would "know" which of the two to use to prevent ambiguity. It has the same information in both cases.
Which one is "right"? Or is BIC really this arbitrary?
UPDATE: I am not trying to compare the Bernoulli to the Binomial model. This is just a toy example. I have a set of comparisons where it matters which setup I use, because the penalties for $N$ are different. I have two sets of model comparisons and the winning model changes based on the $N$ penalty, even though these appear to me to be the same sets of models.
UPDATES 2 and 3: Added the comparisons to the intercept-only model and changed random seed to get a sign difference in the BIC.
g_bern0, eg, defined? If the idea is that it was supposed to be an intercept-only model, I don't see a problem. The actual numbers of the BIC don't matter. It's just a relative measure. In both cases (ie,g_bern0, &g_binom0), the difference is negative, implying the same preference (for the intercept-only model). There is no inconsistency. – gung - Reinstate Monica Mar 11 '19 at 02:01I cannot see how "the two formats are ultimately equivalent—they contain the same information and mostly just look different on the outside" could possibly be true given that one BIC is -18, and the other is 1. This is a staggering difference.
– Salad dressing Mar 11 '19 at 14:43