p-values and confidence intervals are not matching

Question

My model is here. I'm running a bootstrapping mixed-effects model with "case" type with lmersmapler. The thing is, now I'm using bootstrap_pvals()to obtain p-values and the results are not matching. Based on the 95% CIs, I should get all p < 0.05 except for the interaction, but the CIs and p-values are not matching. I need some help.

the model:

mod1 <- lmer(CONT_Y ~ YEAR * MY_GROUP + (1|PARTICIPANTS), data = data)
mod1_boot <- bootstrap(mod1, .f = fixef, type = "case", B = 1000,
                       resample = c(TRUE, FALSE))
> confint(mod1_boot, type = "norm")
A tibble: 4 x 6
term                estimate  lower  upper type  level
  <chr>                  <dbl>  <dbl>  <dbl> <chr> <dbl>
1 (Intercept)           17.6   16.7   18.5   norm   0.95 ##DOESN'T CONTAIN 0
2 YEAR2                 1.14   0.186  2.13  norm   0.95 ## DOESN'T CONTAIN 0
3 GROUPB                0.915  0.155  1.68  norm   0.95 ## DOESN'T CONTAIN 0
4 YEAR2:GROUPB         -0.602 -1.80   0.577 norm   0.95 ## CONTAINS 0!!!
>

obtaining p-values:


bootstrap_pvals(mod1, type = "case", B = B, resample = c(TRUE, F),
                         aux.dist = norm)
$coefficients
A tibble: 4 x 7
term                Estimate Std. Error    df t value Pr(&gt;|t|) p.value
  <chr>                <dbl>        <dbl> <dbl>     <dbl>      <dbl>   <dbl>
1 (Intercept)          17.6          0.393  79.7    44.8     2.67e-58   0.389
2 YEAR2                1.14         0.517  63.0     2.21    3.06e- 2   0.565
3 GROUPB               0.915        0.517  63.0     1.77    8.17e- 2   0.589
4 YEAR2:GROUPB        -0.602        0.731  63.0    -0.824   4.13e- 1   0.568
$B
[1] 1000
ps: conf.int doesn't work on this object

It doesn't make sense to me, the interaction should be ns and the other coefficient should be sig. Any ideas?
how can I get the 95 % CI for a bootstrap_pvals object ?
Edit 1: Trying to convert the factor variables into binary numerical ones:

num_data <- data %>% 
  mutate(YEAR_num = case_when( 
           YEAR == "A" ~ 0, YEAR == "B" ~ 1),
         GROUP_num = case_when(
           GROUP == "G1" ~ 0, GROUP == "G2" ~ 1))
check
> class(num_data$YEAR_num)
[1] "numeric"
> class(num_data$GROUP_num)
[1] "numeric"
refit:
mod1 <- lmer(CONT_Y ~  YEAR_num * GROUP_num + (1|PARTICIPANTS), data = num_data, REML = FALSE)
bootstrap it:
mod_p <- bootstrap_pvals(mod1, type = "case", B = 1000, resample = c(TRUE, F), aux.dist = norm)
check:
mod_p
> mod_p
$coefficients
A tibble: 4 x 7
term                  Estimate Std. Error    df t value Pr(&gt;|t|) p.value
  <chr>                    <dbl>        <dbl> <dbl>     <dbl>      <dbl>   <dbl>
1 (Intercept)             17.6          0.393  79.7    44.8     2.67e-58   0.395
2 YEAR_num                 1.14         0.517  63.0     2.21    3.06e- 2   0.559
3 GROUP_num               0.915        0.517  63.0     1.77    8.17e- 2   0.582
4 YEAR_num:GROUP_num     -0.602        0.731  63.0    -0.824   4.13e- 1   0.574

DATA:

data <- structure(list(PARTICIPANTS = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                                        3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 
                                        7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 
                                        10L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 
                                        14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 
                                        17L, 17L, 17L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 20L, 20L, 
                                        20L, 20L, 21L, 21L, 21L, 21L), CONT_Y = c(19.44, 20.07, 19.21, 
                                                                                  16.35, 11.37, 12.82, 19.42, 18.94, 19.59, 20.01, 19.7, 17.92, 
                                                                                  18.78, 19.21, 19.27, 18.46, 19.52, 20.02, 16.19, 19.97, 13.83, 
                                                                                  15.93, 14.79, 21.55, 18.8, 19.42, 19.27, 19.37, 17.14, 14.45, 
                                                                                  17.63, 20.01, 20.28, 17.93, 19.36, 20.15, 16.06, 17.04, 19.16, 
                                                                                  20.1, 16.44, 18.39, 18.01, 19.05, 18.04, 19.69, 19.61, 16.88, 
                                                                                  19.02, 20.42, 18.27, 18.43, 18.08, 17.1, 19.98, 19.43, 19.71, 
                                                                                  19.93, 20.11, 18.41, 20.31, 20.1, 20.38, 20.29, 13.6, 18.92, 
                                                                                  19.05, 19.13, 17.75, 19.15, 20.19, 18.3, 19.43, 19.8, 19.83, 
                                                                                  19.53, 16.14, 21.14, 17.37, 18.73, 16.51, 17.51, 17.06, 19.42
                                        ), CATEGORIES = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("A", 
                                                                                                                            "B"), class = "factor"), MY_GROUP = structure(c(1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L
                                                                                                                            ), .Label = c("G1", "G2"), class = "factor")), row.names = c(NA, 
                                                                                                                                                                                         -84L), class = c("tbl_df", "tbl", "data.frame"))
rename collumn:
data <- data %>%  rename(., YEAR = CATEGORIES)

Why are there two columns p-values in the lower table? Please improve the formating of the tables. The value Pr(>|t|) is $0.413$ for the interaction so quite large and not significant at the 5% level. — COOLSerdash, Jan 28 '23 at 20:29
@COOLSerdash , I honestly don't know. I've never used this bootstrap_pvals() before, I've read the doc, but it didn't quite answer that. However, it was the only function that gave me pvalues for my lmeresampler model — Larissa Cury, Jan 28 '23 at 20:33
Ok, the documentation clearly states that the last column are bootstrap p-values. But as I've said: Both the "traditional" p-value ($0.413$) as well as the bootstrap p-value ($0.568$) are consistent with the confidence interval. — COOLSerdash, Jan 28 '23 at 20:35
@COOLSerdash , still, tho, then YEAR2, GROUPB and the intercept should be significant. The last column says that they aren't, tho the CI shows otherwise. Any ideas? — Larissa Cury, Jan 28 '23 at 20:37
I've had a look at the source code of bootstrap_pvals and especially bootstrap_pvals.merMod. I think it's a bug stemming from how the code updates the model. Try converting the group and year variables to binary numeric variables and refit the following model: lmer(CONT_Y ~ year_num + group_num + year_num:group_num + (1|PARTICIPANTS), data = data). Then run the bootstrap again. Now the p-values will be consistent (up to random error). I also strongly suggest submitting a bug report here: https://github.com/aloy/lmeresampler/issues — COOLSerdash, Jan 28 '23 at 21:08
@COOLSerdash , thank you for the effort, I really appreciate it, I really really need to solve this! You mean changing it into zeros and ones, for example? Essentially, will it still make sense, since the variables are categorical ? (year and group of exam) ? — Larissa Cury, Jan 28 '23 at 21:18
Yes, change YEAR into 0/1 (A = 0, B = 1) and the same with MY_GROUP (G1 = 0, G2 = 1). Just compare the models to see that they're identical. But please report the bug to improve the package! This will help you and others in the future. — COOLSerdash, Jan 28 '23 at 21:22
Thank you, @COOLSerdash , I tried that, but it didn't work (see the edit above, please). Yes, if course, If I can help others, I'll do that. — Larissa Cury, Jan 28 '23 at 21:37
ps: in addition to that, do you know why confi.int won't work for that? > confint(mod_p, type = "norm") Error in UseMethod("vcov") : no applicable method for 'vcov' applied to an object of class "coef_tbl . Even if I get the right p-values, I'm not getting the CIs and I need to report both — Larissa Cury, Jan 28 '23 at 21:38
I'll report the bug. The confidence intervals are calculated with the other function. Report those. — COOLSerdash, Jan 28 '23 at 21:39

score 3 · Accepted Answer · answered Jan 29 '23 at 22:41

I suspect that this question is written, in part, from an expectation that there is a one-to-one correspondence between confidence intervals and p-values. While some confidence intervals can be derived by inverting a hypothesis test (and then the one-to-one correspondence is exact), this is not the case for all confidence intervals.

With the (nonparametric) bootstrap we get resamples from the bootstrap distribution $\hat{F}$, so that we can calculate the statistic $T$ on each resample: $$ \begin{aligned} \hat{F} \longrightarrow \mathbf{x}^* \longrightarrow t^* \end{aligned} $$ Notice that no hypothesis, either null or alternative, makes an appearance.

On the other hand, a p-value is defined as a long-run frequency under the null hypothesis. Let's say that the null hypothesis is $T = 0$ and the p-value is two-sided. Then $$ \begin{aligned} p = \operatorname{Pr}\left\{|T| \geq |t| ~\big|~ H_0\right\} \end{aligned} $$

The "under null hypothesis" part is crucial to the meaning of p-values. So to bootstrap p-values, we would need to generate sample from the null hypothesis sampling distribution: $$ \begin{aligned} p = \operatorname{Pr}\left\{|T| \geq |t| ~\big|~ \hat{F}_0\right\} \end{aligned} $$ And this is challenging; we'll need to make assumptions in order to sample from $\hat{F}_0$.

An obvious approach is to assume that the observed data are generated by the linear mixed model we have in mind: $\mathcal{F} = \left\{f_{\boldsymbol{\beta},\mathbf{u},\sigma^2}\right\}$ where $\boldsymbol{\beta}$ are the fixed effects, $\mathbf{u}$ are the random effects and $\sigma^2$ is the error variance. And we use the parametric rather than the nonparametric bootstrap.

$$ \begin{aligned} f_{\hat{\boldsymbol{\beta}},\hat{\mathbf{u}},\hat{\sigma}^2} \longrightarrow \mathbf{x}^* \longrightarrow t^* \end{aligned} $$

Now that we have an explicit model for the data, we can write down the model under the null hypothesis that one specific coefficient, say $\beta_k$, is equal to 0. We can fit this restricted model to the data and then use it to generate resamples $\left\{\mathbf{x}^* | H_0\right\}$. Finally, we fit the full model to those resamples to simulate the distribution of $T_k = \beta_k/\operatorname{se}(\beta_k)$ under the null hypothesis that $\beta_k = 0$.

$$ \begin{aligned} p = \operatorname{Pr}\left\{|T_k| \geq |t_k| ~\big|~ f_{\hat{\boldsymbol{\beta}}_{-k},\beta_k=0,\hat{\mathbf{u}},\hat{\sigma}^2}\right\} \end{aligned} $$

From looking at the code, this is the theory behind the implementation of lmeresampler::bootstrap_pvals.

However, there is an issue with this approach when applied to a model with interactions or polynomial terms. A model that has no main effect for either Year or Group but includes the Year×Group interaction doesn't make sense. Similarly, a model that drops the linear term but leaves in the higher order polynomials (of the same continuous predictor) won't make sense. As a result, the p-values computed from those weird bootstrapping resamples might not be not interpretable either.

As a demonstration, I fit the model Y = Group + Year + (1 | ID) and then use your code to compute confidence intervals and p-values; note that now the CIs and the p-values agree because it's reasonable to propose a model that has only a main term for Year or a main term for Group.

mod0 <- lmer(
  Y ~ GROUP + YEAR + (1 | PARTICIPANT),
  data = data1,
  REML = FALSE
)
mod0_boot <- bootstrap(
  mod0, B = 2000,
  type = "case", resample = c(TRUE, FALSE),
  .f = fixef
)
confint(mod0_boot, type = "norm")
#> # A tibble: 3 × 6
#>   term        estimate   lower upper type  level
#> 1 (Intercept)   17.8   16.9    18.7  norm   0.95
#> 2 GROUP2         0.614 -0.0146  1.22 norm   0.95
#> 3 YEARB          0.843 -0.0552  1.74 norm   0.95
mod0_pvals <- bootstrap_pvals(
  mod0, B = 2000,
  type = "case", resample = c(TRUE, FALSE),
  aux.dist = norm
)
mod0_pvals
#>   term        Estimate Std. Error t value  p.value
#> 1 (Intercept)   17.8          0.349     50.9  0.000500
#> 2 GROUP2         0.614        0.368      1.67 0.0625

#> 3 YEARB          0.843        0.368      2.29 0.0600

I take two lessons from this exercise.

Lesson 1: A regression which keeps an interaction but drops a main effect is not meaningful.

When you bootstrap the p-value of interaction from the model Y ~ Year * Group, you get a meaningful p-value only for the interaction because the model which "zeros out" the interaction but leaves in the intercept and the two main effects is a meaningful model. You don't get sensible p-values for the interaction and the main effects because zeroing out these coefficients while keeping the interaction doesn't give a meaningful model.

Lesson 2: Estimation is (often) more meaningful than hypothesis testing.

It seems to me that bootstrapping p-values from a linear mixed model is a strange thing to do as it requires making more assumptions that computing (nonparametric) bootstrap confidence intervals. Most importantly, we have to decide how to we generate bootstrap resamples under $H_0$. So I suggest to stick with estimation (this is what you get with the bootstrap confidence intervals) than hypothesis testing (this is what you get with the bootstrap p-values).

In short, what you describe is missing in the implementation is the adherence to the principle of marginality. — COOLSerdash, Jan 30 '23 at 16:59
@COOLSerdash I didn't know the term, only the advice not to include interactions without the corresponding main effects. So thank you for pointing out the term to me. — dipetkov, Jan 30 '23 at 17:34

p-values and confidence intervals are not matching

A tibble: 4 x 6

A tibble: 4 x 7

check

refit:

bootstrap it:

check:

A tibble: 4 x 7

rename collumn:

1 Answers1

Linked