Confidence-interval / p-value duality vs. frequentist interpretation of CIs

Question

Many sources suggest that there is a duality between confidence intervals and hypothesis testing.(*) But I'm having trouble making sense of this philosophically. The frequentist interpretation of a confidence interval is something like (per wikipedia):

Were this procedure to be repeated on multiple samples, the calculated [90%] confidence interval (which would differ for each sample) would encompass the true population parameter 90% of the time.

Yet the p-value is defined in terms of values the sample mean might take on if the null hypothesis is true. (I.e. in the one-tailed case: $p = P(\bar x\ge \bar x_{observed}\mid\mu = H_0)$).

How is it possible to manipulate a statement about a procedure that is likely to correctly bound the true population mean into a statement about the probability of the observed sample mean?

If we understand the confidence interval as characterizing the distribution of the means of samples from a population (a view the bootstrap procedure invites), then there's no problem. There's an obvious symmetry between the case in which there is a < 5% chance of the sample mean being more extreme than $H_0$, given the actual population (i.e. $H_0$ outside of the 95% CI) and the case in which there is a < 5% chance of getting a sample mean as extreme as was observed, given that the population is really centered at $H_0$ (i.e. $p<0.05$).

But this interpretation of CIs seems to be disfavored! In particular, the wikipedia article admonishes: "A confidence interval is not a range of plausible values for the sample mean, though it may be understood as an estimate of plausible values for the population parameter."

Even if the CI is in fact a range of plausible values of the sample mean, a question remains. How precisely is such a definition equivalent to the frequentist procedure definition above?

(*) A good example is this Minitab blog post:

The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.

If the confidence interval does not contain the null hypothesis value, the results are statistically significant.

If the P value is less than alpha, the confidence interval will not contain the null hypothesis value.

A brief footnote regarding your opening premise: the 'duality' between confidence intervals and hypothesis testing stems from the fact that most often confidence intervals are constructed by inverting hypothesis tests. — Durden, Mar 22 '23 at 15:52

score 3 · Answer 1 · answered Jun 29 '21 at 20:50

You use different null hypotheses in each situation.

When performing a hypothesis test, you set the null hypothesis to some value you are attempting to test the implausibility of. Let's consider the following model:

$$ Y = \beta * X + \epsilon $$

You will collect some data and with it, compute an estimate of $\beta$, which we will call $\hat{\beta}$. Then, you will generally set up a hypothesis test as follows:

$$ H_0 : \beta = 0 $$ $$ H_1 : \beta \neq 0 $$

The p-value is computed in the ordinary way for whatever test you are using. To compute a confidence interval, you use the following null hypothesis, which tests whether or not the true value for $\beta$ is equal to the estimated value you observed.

$$ H_0 : \beta = \hat{\beta} $$ $$ H_1 : \beta \neq \hat{\beta} $$

Say you are trying to compute a 95% confidence interval. You would find the bounds of the rejection region of the null distribution (that is, where p = 0.025 for both one-tailed tests) and, after converting your test statistic back to the units of $\beta$, you have your confidence interval.

This is where the duality of hypothesis testing and confidence interval computation comes in - this confidence interval contains the true value for $\beta$ for the same reason that setting $\alpha$ to 0.05 gives you a 5% Type I error rate. Of course, this depends on your test of choice actually being able to maintain the nominal Type I error rate for your dataset, but that's a separate issue entirely.

+1 A way I like to think about it is that $p$ is the value such that a $(1 - p)%$ confidence interval will have $\hat{\beta}$ as an endpoint. — Dave, Jun 29 '21 at 20:52

Confidence-interval / p-value duality vs. frequentist interpretation of CIs

1 Answers1

Linked