Why do the bootstrap calculated p-value and the confidence intervals seem to contradict each other [R code, specific example]?

Question

Below you'll find the data and the corresponding R code used to perform the bootstrap hypothesis test to compare the ratio of the means of two samples and additionally, to estimate the 95% confidence interval of such ratios for each of the sample, again via bootstrapping.

X1 <- c(9947.2, 3978.9, 37799, 755.99, 6883.5, 61673, 79577, 15915, 49736, 41800, 31800)
Y1 <- c(5172500, 3163200, 6366200, 915140, 3023900, 1909900, 4894000, 4854200, 3561100, 5829100, 3959000, 2407200, 3779900, 1651200, 3779900)
X2 <- c(216850, 4854.2, 5968.3)
Y2 <- c(65651, 63662, 39789)
D <- mean(X1)/mean(Y1) - mean(X2)/mean(Y2)
boot_D <- rep(0, 10000)
for (i in seq_along(boot_D)) {
  boot_D[i] <- mean(sample(X1, replace=TRUE)) / mean(sample(Y1, replace=TRUE)) -
    mean(sample(X2, replace=TRUE)) / mean(sample(Y2, replace=TRUE))
}
observed_D <- mean(X1)/mean(Y1) - mean(X2)/mean(Y2)
boot_D.underH0 <- boot_D - mean(boot_D)
mean(abs(boot_D.underH0) > abs(observed_D))

The calculated p-value is:

[1] 0.0928

And the corresponding confidence intervals:

boot_CL1 <- rep(0, 10000)
boot_CL2 <- rep(0, 10000)
for (i in seq_along(boot_CL1)) {
  boot_CL1[i] <- mean(sample(X1, replace=TRUE)) / mean(sample(Y1, replace=TRUE))
  boot_CL2[i] <- mean(sample(X2, replace=TRUE)) / mean(sample(Y2, replace=TRUE))
}
quantile(boot_CL1, probs = c(.025, .975))
quantile(boot_CL2, probs = c(.025, .975))

       2.5%       97.5% 
0.004434466 0.013348657

      2.5%      97.5% 
0.08040818 3.80236249

The confidence intervals do not overlap each other, which makes me wonder why there is a discrepancy between the calculated p-value and the intervals. Shouldn't the p-value in such case be lower than 0.05? If yes, is there an error somewhere in the implementation of either of the procedures?

There are just three observations in X2 and Y2. I don't see how bootstrapping those samples is a good idea. — dipetkov, Jan 30 '23 at 00:01
Do you mean four samples? X1 and Y1 have different lengths. Also agree that X2/Y2 is too small. In any case, your p-value in the first part is wrong - you aren't doing a permutation there, but a bootstrap, so of course it won't agree with the CIs. — David B, Jan 30 '23 at 00:06
@dipetkov what would be a suitable alternative? I could do a permutations test instead, but I doubt that the observations are truly exchangeable. — Treex, Jan 30 '23 at 08:18
@DavidB could you please clarify this a bit more? So in case both CIs and the p-values are produced via bootstrapping they don't necessarily have to agree? How so? — Treex, Jan 30 '23 at 08:20
Why do you think that "observations are not truly exchangeable" would be okay for the nonparametric bootstrap? Gathering more data seems the only reasonable alternative. — dipetkov, Jan 30 '23 at 08:51
@dipetkov That's the only alternative I could come up with instead of plainly gathering more data. What you suggest could indeed be a solution in general, but I'm also interested in the specific issue here. Do you perhaps have some idea why there's a discrepancy between the p-value and the CIs? — Treex, Jan 30 '23 at 09:31
@Treex you misunderstand calculating p-values in bootstrap. Where did this approach even come from? I would close this as a duplicate of this https://stats.stackexchange.com/questions/20701/computing-p-value-using-bootstrap-with-r but the bounty prevents me from doing so. — AdamO, Jan 31 '23 at 18:28
@dipetkov there are small sample corrections for bootstrap, plus there should be methods to calculate CIs and p-values that agree in all sample sizes, for instance agreement of Wald-test and normal CIs is not an asymptotic result, it's an exact one. — AdamO, Jan 31 '23 at 18:28
@AdamO Have you actually looked at the data in this quesion? — dipetkov, Jan 31 '23 at 18:29
@dipetkov do you not think statistical inference is possible based on N=3? — AdamO, Jan 31 '23 at 18:30
While the theory of bootstrap CI and bootstrap p-values is interesting, the bootstrap can fail. See What are examples where a "naive bootstrap" fails? You'll learn a lot from plotting histograms of the boostrap sample of E(X1) / E(Y1), the bootstrap sample of E(X2) / E(Y2) and finally the bootstrap sample of the difference in ratios. — dipetkov, Jan 31 '23 at 18:57

AdamO · Answer 1 · 2023-01-31T20:12:33.217

4

The bootstrap does not generate the distribution under the null hypothesis. Note that your null is that the ratios are equal:

$$ \mathcal{H}_0: E(X_1)/E(Y_1) = E(X_2)/E(Y_2) $$

And yet, when you supposedly generate data "under the null" by calculating:

mean(sample(X1, replace=TRUE)) / mean(sample(Y1, replace=TRUE)) -
    mean(sample(X2, replace=TRUE)) / mean(sample(Y2, replace=TRUE))

you do not, in fact, generate these data with an equal ratio. Consider X1 could have mean 4 and Y1 could have mean 5 whereas X2 has mean 9 and Y2 has mean 10, so your null would have ratios of 4/5 in the first sample and 9/10 in the second. Not good!

If you search bootstrap p value you find many posts, the most active here: Computing p-value using bootstrap with R and it does make one wonder why you can't in fact use a permutation test in this case (swap labels between X1 and X2 and between Y1 and Y2... though the quantile CI may still disagree!). Because the result shows a highly skewed value, it would be preferrable to use the log transform. The arithmetic of the problem becomes much easier.

edited Jan 31 '23 at 20:12

answered Jan 31 '23 at 19:16

AdamO

62,637

1

The OP wants to compare the ratio of the means, not the mean of the ratios. And the samples are not paired. Does the log transform make sense for comparing the ratios of the means? – dipetkov Jan 31 '23 at 19:44
1

@dipetkov good point, edited the statement of null. I still think a log transform makes sense. It's one of the strategies discussed in Kronmal's fallacy of the ratio, which overall provides a good reference for handling these kinds of problems. For instance, if the data are normal, then the sample mean is normal, but the ratio of normals may not have finite mean. There's clearly an unstated constraint on the support of the values, and looking at the histogram in these data shows a skewness that's ameliorated with log transforms - part of the "non-pivotal" issue affecting the proposed approach. – AdamO Jan 31 '23 at 20:15
1

Thank you for the clarification. My concern remains that we have one question and four answers and not a single plot of the bootstrap distributions. I've made those histograms and I know that the second ratio is far from normal. So while the theory is elegant, if it were me, I won't make any decisions based on the bootstrapping the data provided in the question. – dipetkov Jan 31 '23 at 20:19
@AdamO I'm actually using your approach suggested here in the comment https://stats.stackexchange.com/questions/20701/computing-p-value-using-bootstrap-with-r
mean(abs(b3$t0) < abs(b3$t-mean(b3$t)))

which is basically what I do with

boot_D.underH0 <- boot_D - mean(boot_D) mean(abs(boot_D.underH0) > abs(observed_D))

So that approach is not appropriate in my case?
– Treex Feb 01 '23 at 12:38
@AdamO And I was thinking of using a permutation test, but it seems to me that the observations are not exchangeable. Hence, I couldn't really compare the means, but perhaps the equality of distributions - which is what I'm not interested into.
I can also understand that due to small sample size the bootstrap test is not really appropriate, but that's the only alternative I could come up with.
– Treex Feb 01 '23 at 12:46
1

@Treex how are you defining observations to be "exchangeable"? In other words, if the null were true how could you re-arrange the data labels? I contend that under the null, the X_1 and X_2 and Y_1 and Y_2 labels could be permuted respectively. – AdamO Feb 01 '23 at 17:50
@AdamO I'm referring to Glen_b's answer and yours, and additionally the discussion you had with him in comments here: https://stats.stackexchange.com/questions/87215/does-a-big-difference-in-sample-sizes-together-with-a-difference-in-variances-ma Furthermore, this comes to mind https://academic.oup.com/bioinformatics/article/22/18/2244/317881 – Treex Feb 01 '23 at 19:56
(ctd) Also, I'm assuming that the distributions of X1 and X2 or Y1 and Y2 are quite different in shape and hence labels do not seem exchangeable to me. – Treex Feb 01 '23 at 20:04

score 3 · Accepted Answer · answered Jan 31 '23 at 19:48

I agree with @David B that your p-value is incorrectly calculated. This is another way to do the estimate:

First, it helps to specify your null hypothesis. I will assume that the hypothesis of interest is:

$$H_0: \frac{X_1}{Y_1} - \frac{X_2}{Y_2} = 0 $$

X1 <- c(9947.2, 3978.9, 37799, 755.99, 6883.5, 61673, 79577, 15915, 49736, 41800, 31800)
Y1 <- c(5172500, 3163200, 6366200, 915140, 3023900, 1909900, 4894000, 4854200, 3561100, 5829100, 3959000, 2407200, 3779900, 1651200, 3779900)
X2 <- c(216850, 4854.2, 5968.3)
Y2 <- c(65651, 63662, 39789)
set.seed(193)
N <- 100000
D_vec <- sapply(1:N, function(x) {
  mean(sample(X1, size = length(X1), replace = TRUE)) /
    mean(sample(Y1, size = length(Y1), replace = TRUE)) -
    mean(sample(X2, size = length(X2), replace = TRUE)) / 
    mean(sample(Y2, size = length(Y2), replace = TRUE))
}, USE.NAMES = FALSE)
D <- mean(X1)/mean(Y1) - mean(X2)/mean(Y2)
D
[1] -1.337976
mean(D_vec)  # slight bias in the sample
[1] -1.357764
abs(D - mean(D_vec))
[1] 0.01978761
quantile(D_vec, probs = c(0.025, 0.975))  # 0 lies outside the confidence interval
       2.5%       97.5% 
-3.79242323 -0.07188302
the ratio of X1 to Y1 is significantly different from the ratio of X2 to Y2 at the alpha=0.05 level
hist(D_vec, breaks = 100) # not normally distributed and not continuous due to the small number of X2, Y2 samples
any(D_vec > 0) # no estimates are greater than zero
[1] FALSE

Conclusion: There is enough evidence to suggest that the ratio of X1 to Y1 is different from the ratio of X2 to Y2 at the $\alpha = 0.05$ level.

Sextus Empiricus · Answer 3 · 2023-02-01T09:24:05.963

You are computing very different things

boot_D[i] <- mean(sample(X1, replace=TRUE)) / mean(sample(Y1, replace=TRUE)) -
mean(sample(X2, replace=TRUE)) / mean(sample(Y2, replace=TRUE))

That simulates a sample distribution of the difference between the ratios.

boot_CL1[i] <- mean(sample(X1, replace=TRUE)) / mean(sample(Y1, replace=TRUE))
boot_CL2[i] <- mean(sample(X2, replace=TRUE)) / mean(sample(Y2, replace=TRUE))

That simulates sample distributions of the ratios themselves.

Why are these different

Overlap

An overlap of confidence intervals is not equivalent to a hypothesis test. See this discussion about overlapping error bars Why is mean ± 2*SEM (95% confidence interval) overlapping, but the p-value is 0.05?

But this is not entirely your problem. You can also add a line

quantile(boot_CL1-boot_CL2, probs = c(.025, .975))

which computes the confidence interval of the difference instead of the difference of confidence intervals. In this case you still get a different result from what the p-value suggests.

Symmetry

the confidence interval is often estimated based on the idea that the distribution is symmetrical.

This symmetry means that the probability to get value $X_a$ or higher/lower given that the true value is $X_0$ is equivalent to the probability to get the value $X_0$ or lower/higher given that the true value is $X_a$.

But that assumption of symmetry may be false or is just an approximation.

The estimation of the confidence interval may use that symmetry assumption, but to be exact it should compute the p-values for every value inside the interval seperately. When one computes a p-value for a certain specific hypothesis (instead of a confidence interval, which is a range of p-values and more difficult to compute) then often the more exact approach is used.

As an example compare the difference between a Wald interval and Wilson score interval used in confidence intervals of a binomial proportion.

Let's look at a graph of the sample distribution based on a bootstrapping sample simulation

The confidence interval bounds are not distributed symmetrically. With the construction of the confidence bounds you (probably falsely) assume that the probability to observe a zero given that the true value is the observed value, is the same as the probability to observe the observed value given that the true value is zero.

With your computation of the p-value, you do differently and shift the entire curve. Now you make a comparison with the longer left tail.

Another problem is that the estimated cumulative distribution is very coarse and that relates to the small sample size of the 2nd sample. Due to the large values it dominates the computer $D$ statistic and there are only very few combination in the bootstrap (basically the 27 ways to sample $X_1$). So with this sample you can not compute a reliable 95% confidence interval based on bootstrapping, as your estimate of the distribution of $X_1$ is very poor. — Sextus Empiricus, Feb 01 '23 at 09:34

David B · Answer 4 · 2023-01-31T16:42:46.750

0

Your p-value is incorrectly calculated. To get a p-value from permutation you simulate a dataset that is as close to the true data as possible, but where the null-hypothesis is true. The bootstrap resampling procedure you use does not accomplish this. Instead, you want to permute group assignments - whether a given observation is assigned to X1 or X2, or to Y1 or Y2, should be randomized.

Edit Your code treats boot_D as if it is a null distribution, so I assumed that was what you were trying to make. The problem is, it is not a null distribution, so what you computed is not a p-value. What it is, is the bootstrapped distribution of the difference in ratios. Looking at the 95% CI of the bootstrap:

> quantile(boot_D,probs = c(.025, .975))
       2.5%       97.5% 
-3.79239918 -0.07143669

You can see that it doesn't overlap with the zero - i.e., the mean difference is different from zero, p<0.05.

edited Jan 31 '23 at 16:42

answered Jan 30 '23 at 13:48

David B

1,532

I think that would be the case if the test in question was a permutation test. However the test in question is a bootstrap test - no randomization of the labels between groups. – Treex Jan 30 '23 at 19:43
See my edit, your p-value isn't a p-value. – David B Jan 31 '23 at 16:43
This is what the last part of the code is for, to be able to compare the observed statistic to its distribution under H0, so I presumably correct for this.
Based on this question and the answer by jan s.

boot_D.underH0 <- boot_D - mean(boot_D) mean(abs(boot_D.underH0) > abs(observed_D))
– Treex Jan 31 '23 at 16:58
1

mean-centering an empirical distribution doesn't automatically make it null. In general, it only works on normal distributions. – David B Jan 31 '23 at 17:28
You can use permutations to test the equality of distributions. But the distributions of X1, Y1, Y1 and Y2 are not considered the same. – Sextus Empiricus Feb 01 '23 at 09:40

Why do the bootstrap calculated p-value and the confidence intervals seem to contradict each other [R code, specific example]?

4 Answers4

the ratio of X1 to Y1 is significantly different from the ratio of X2 to Y2 at the alpha=0.05 level

Overlap

Symmetry

Linked