I'm (re)studying hypothesis testing from the book All of Statistics by Larry Wasserman (2005, Chapter 10, link), because when I studied the concept in one of my university courses I didn't understand it very well.
I still don't understand some points that many may find simple, and here I'll review step by step the main concepts at the basis of hypothesis testing and p-values as presented in the book and then I'll point out my doubts.
Context
Wasserman begins by saying that first we choose a null hypothesis and a alternative hypothesis:
$H_0: \theta \in \Theta_0$ versus $H_1: \theta \in \Theta_1$
where $\theta$ is a parameter we want to test. The goal of a hypothesis test is finding a rejection region $R$ so that:
- if $X \in R$ then we reject $H_0$
- if $X \notin R$ then we don't reject $H_0$
Then the book talks about the type I and type II errors and introduces the concept of power function and size:
- Power function: $\beta(\theta)=P_{\theta}(X \in R)$
- Size: $\alpha=sup_{\theta \in \Theta_0} \beta(\theta)$
And it says:
A test is said to have level $\alpha$ if its size is less than or equal to $\alpha$.
Next it introduces the one and two-sided tests, and it then gives a one-sided test example.
Let $X_1, ..., X_n \sim N(\mu, \sigma)$ where $\sigma$ is known. We want to test $H_0: \mu \le0$ versus $H_1: \mu>0$. [...] reject $H_0$ if $T>c$, where $T=\bar{X}$.
Then it calculates the power function for the mean $\mu$:
$\beta(\mu)=P_{\mu}(\bar{X}>c)=P_{\mu}(\frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}>\frac{\sqrt{n}(c-\mu)}{\sigma})=P(Z>\frac{\sqrt{n}(c-\mu)}{\sigma})=1-\Phi(\frac{\sqrt{n}(c-\mu)}{\sigma})$
Then it calculates the size:
$size=sup_{\mu\le0}\beta(\mu)=1-\Phi(\frac{\sqrt{n}c}{\sigma})$
And then:
For a size $\alpha$ test, we set this equal to $\alpha$ and solve for $c$ to get $c=\frac{\sigma\Phi^{-1}(1-\alpha)}{\sqrt{n}}$ We reject when $\bar{X}>\frac{\sigma\Phi^{-1}(1-\alpha)}{\sqrt{n}}$. Equivalently, we reject when $\frac{\sqrt{n}(X - 0)}{\sigma}>z_{\alpha}$
Then after some pages Wasserman introduces the concept of p-value:
$p-value=inf\{\alpha: T(X^n) \in R_{\alpha}\}$
That is, the p-value is the smallest level at which we can reject $H_0$. Informally, the p-value is a measure of the evidence against $H_0$: the smaller the p-value, the stronger the evidence against $H_0$.
The p-value is the probability (under $H_0$) of observing a value of the test statistic the same as or more extreme than what was actually observed.
Questions
As I understood, the steps in the example are:
- Define the null and alternative hypothesis on the basis of the parameter we test.
- Choose the test statistics. If the test statistic (from now on t.s.) is the sample mean as in the example given by Wasserman, for the central limit theorem we can approximate t.s. as a standard normal random variable.
- Choose the level $\alpha$ at which we perform the test. By definition, this number is the maximum of the probability that $X \in R$, where the maximization is made on the parameter. So can we say that $\alpha$ is the maximum of the probability of rejecting the null hypothesis? Notice that it's not like saying $\alpha$ is the probability that the null is true.
- After choosing $\alpha$ we calculate $R$. In the example $R$ is given by $c$, which is found thanks to the standard normal quantile at level $\alpha$. In this sense $\alpha$ represents the area under the normal curve at the right of the quantile. So the smaller $\alpha$, the smaller the area, the more the quantile must be to the right.
- Then we see if $\bar{X}>c$, that is if we reject the null. If $\bar{X}>c$, this means we chose a priori an $\alpha$ so big that we can say a posteriori that t.s. in the reject region (i.e. we had luck, assuming the null is true), that is we chose a probability threshold so big that we can say $\bar{X} \in R$. But if we choose a priori alpha too low, it can be that a posteriori t.s. is to the left of the quantile, hence $\bar{X} \notin R$ (i.e. we didn't have luck, assuming the null is true). But then, why the smaller the p-value, the stronger the evidence against the null? From the previous reasoning it seems that choosing a higher alpha would result in a higher probability of finding that t.s. is in $R$, hence a higher probability that the data suggests rejecting the null.
Also, if the p-value is found on the basis of the observed data, and p-value is the lowest alpha that gets me $\bar{X} \in R$, why I need the p-value if I already know the value of t.s. and hence I know if $\bar{X} \in R$ with certainty? You could say that the p-value is needed in order to find $R$, but since then $R$ is arbitrarily chosen, why can't I choose an alpha so big that I'm sure $\bar{X} \in R$?