1

In hypothesis testing, the definition of p value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

My question is why the "at least as extreme" part? Why not just the probability of the result actually observed occuring?

For example, testing if a coin is fair:

H0: P(Heads) = 0.5

HA: P(Heads) > 0.5

We run a test on the coin resulting in 8 Heads out of 10 coin flips.

The p-value is P(8 Heads | H0 is True) + P(9 Heads | H0 is True) + P(10 Heads | H0 is True).

Again, my question is why isn't the p-value just P(8 Heads | H0 is True)?

AJP
  • 11
  • So if there are 8 heads you say the coin is biased but if it is 9 or 10 you say it is fair? That seems strange to me. – mdewey Oct 15 '23 at 15:55
  • The null and alternative should partition the space, so your null could be rejected if you saw small numbers too. – dimitriy Oct 15 '23 at 16:18
  • The question was closed as "duplicate" but the linked question doesn't address the specific issue raised here => vote reopen. – Christian Hennig Oct 15 '23 at 17:41
  • 1
    @Christian A quick search of the duplicate for the word "extreme" indicates virtually all of the 15 answers in that thread address this question. There are, as one would expect, other duplicate threads -- they're just harder to find. This site search turns up many promising hits. – whuber Oct 15 '23 at 19:24

1 Answers1

2

Well, if there are 100,000 coin tosses, the probability of any specific result such as 50,039 is very low, so in that case a low probability of whatever we observe doesn't tell us anything about the null hypothesis. By the way, for continuous random variables the probability of any specific observation is 0. That wouldn't be an appropriate "p-value".

The general logic of hypothesis tests is that we reject the null hypothesis if the result is in a pre-specified rejection region that has low probability, say 5%. These rejection regions are chosen in such a way that if indeed we observe an event in the rejection region, this would indicate against the null hypothesis and in favour of the alternative hypothesis. So if as in your example we test $H_0:\ q\le 0.5$ against $H_1:\ q>0.5$ with $q=P($Heads), a reasonable rejection region has the form $R_\alpha=\{X\ge c\}$ with $X$ the random variable giving the observed number of heads (or equivalently the observed proportion), where $c$ is chosen in such a way that $P(R_\alpha)=\alpha$ (or potentially just smaller if exact equality is impossible due to discreteness). This defines a test that has a guaranteed performance characteristic: If $H_0$ is true, and the test is applied often, in the long run it will only reject in a proportion of $\alpha$ of cases.

Now the p-value $p$ is the probability of the rejection region that would just exactly lead to rejection based on our observed data, i.e., $R_p=\{X\ge x\}$ where $x$ is what we observed in our data. Thus, $p$ is the probability of a proper rejection region based on what we observe (in other words, a result as far or farther away from what is expected under $H_0$), rather than just the probability of what we observe.

  • Minor quibble, agree with rest of answer. Wouldn’t fairness require a two-sided test, so that too few and too many heads bothcount against the null? – dimitriy Oct 15 '23 at 16:36
  • @dimitriy Such a test can be run in a one-sided or two-sided way, depending on what the interest in reality is. For example there may be some reason to suspect that somebody biased the coin in one specific direction. The one-sided case makes notation a bit easier, so I'm fine with this as illustrating example. – Christian Hennig Oct 15 '23 at 17:30