Is estimating confidence intervals (CI) with different sample sizes in each bootstrap valid?

Question

I am trying to estimate a confidence interval using bootstrapping. As R data.frame my data looks like

library(data.table)
df <- data.table(compound= c(rep("ala", 5), rep("beta", 3), rep("phe", 8)),
             obs = c(rep(FALSE, 7), rep(TRUE, 9)))

The statistic I am interested in is the percentage of TRUE values compared to the number of observations (9/16*100 = 56% for my example data). In my confidence interval I would like to account for the fact that my compounds were selected at random from a large number of compounds. Hence I would have intuitively done something like that (as written in R):

boot::boot.ci(boot::boot(data.frame(var = df$compound),
                         function(data, indices, stat_tab = df){
                       comp_samp &lt;- data[indices,]

                       fin_tab &lt;- 
                       lapply(as.list(comp_samp), function(x, stat_tab_l = stat_tab ){
                         stat_tab_l[x == compound]
                         })

                       fin_tab &lt;- rbindlist(fin_tab )

                       round(nrow(fin_tab[obs == TRUE])/nrow(fin_tab )*100,1)
                     },
                     R = 1000),
          index=1,
          type='basic')$basic

Is that a valid thing to do? I am a bit confused since my compounds can lead to different numbers of observations (rows in df) which means that in the different bootstrap samples I will have different numbers of observations when sampling by compound. In case it is not valid, why is that and is there a better way to estimate the CI in my scenario? Thank you

score 1 · Accepted Answer · answered Aug 06 '20 at 15:00

For a TRUE/FALSE outcome variable you should use logistic regression instead, and evaluate all the compounds at once in a single model. If you are primarily interested in the set of compounds you evaluated the model could be something like the following fixed-effects model (in R):

glm(obs ~ compound, family = binomial)

Here compound would be a multi-level categorical variable. One of the compounds would be specified as the reference; the intercept would be the log-odds of obs=TRUE for that compound. The regression coefficients for the other compounds would be the differences from that reference in log-odds. The standard errors reported for the intercept and regression coefficients provide (with some calculation) confidence intervals for the individual compounds. You would use standard post-hoc tests based on those coefficients and standard errors to examine differences among compounds.

If you instead want to model sampling of these specific compounds from a larger universe of compounds, you could consider instead a random-effects model. In R:

glmer(obs ~ (1|compound), family = binomial)

Then the intercept is an overall intercept for all the compounds, and the individual compounds in your sample are modeled with a Gaussian distribution of intercepts around that value. The model will report the variance among compounds around the intercept. But you won't get confidence intervals for the individual compounds this way. The results will of course depend on the particular compounds in your sample, and the quality of generalization would depend on the representativeness of your sample.

You certainly could in either case use bootstrapping in addition, which could be a good check on the quality of the model. Bootstrap from all of the cases. There will be different representation of the compounds among the bootstrap samples, but the total sample size (which is what matters) will be same for all. For the fixed-effects model, try modeling on a large number of bootstrap samples and see how well the bootstrap-derived models perform on the full original data set. For random-effects modeling you could see how stable the reported individual random effects were from bootstrap sample to bootstrap sample.

thank you for this very detailed answer! so if I understood you correctly I could use the variance as calculated via glmer() as a measure of uncertainty which is also considering the random sample of compounds. If that is true, where can I find the variance in the resulting glmerMod object? — yasel, Aug 06 '20 at 15:36
@yasel the variance for the random-effects model is the variance among the compounds around the shared intercept. You don't get further variance estimates for the individual compounds. The summary() function applied to the model object reports the variance for each random effect. For a simple model like this the coef() function reports coefficients for each compound while ranef() reports the deviations around the overall intercept. It gets trickier with more complicated models; see this page for an example. — EdM, Aug 06 '20 at 15:56
@yasel it's possible that you could get some type of confidence intervals (CI) for the individual random effects if you wish by repeating the model on multiple bootstrap samples of the data and examining the distribution of coef values. My reluctance to say for sure has to do with my ignorance. Bootstrapping to get CI isn't as straightforward as it initially seems, and I'm not sure how appropriate such CI analysis is for random effects. — EdM, Aug 06 '20 at 16:06
hmmm ok thank you. Since I am far away from being a statistician I won't dare to use the random effects model on individual bootstraps to get my uncertainties then. However, the information that the different compound representations are no problem was already a big help for me thanks! — yasel, Aug 06 '20 at 16:46
@yasel see this answer and other discussion on that page for how to do bootstrapping properly with random-effect or mixed models. — EdM, Aug 06 '20 at 16:52

Is estimating confidence intervals (CI) with different sample sizes in each bootstrap valid?

1 Answers1