7

I teach introductory statistics to undergraduates and they are often confused with hypothesis testing. In particular, while the rule is

we reject the null hypothesis at significance level $\alpha$ when p value is less than $\alpha$

they many times interpret it the opposite. Say, if p value is 0.04, they say "we reject at 1% but not at 5%".

On one level, it is about the deeper understanding, which might be my fault as a teacher. But on another level (given that we are not always engaging with the deeper side of things), perhaps a cool mnemonic tip would help them with correct interpretation

Do you have a cool, undergraduate-level tip about how to correctly interpret p values vs significance level $\alpha$? I haven't come across any such tip.

User1865345
  • 8,202
luchonacho
  • 2,742
  • 4
  • 26
  • 47
  • 2
    Interesting, students & academics are often confused about p-values. but this is a new one to me – innisfree Feb 24 '21 at 05:37
  • 1
    The more you emphasise P-values, the less attention is needed for whether the hypothesis is rejected. – Nick Cox May 29 '23 at 11:10
  • I see exactly this one too often. I still have the hope that most of such answers come from students that say that they are not interested in statistics – Ute May 29 '23 at 12:03

4 Answers4

4

This surely will not top the list of possible "cool undergraduate-level tips", but simply recalling the definition of a p-value might be helpful (quoted from Wikipedia):

The probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

So the smaller the probability, the smaller significance level at which we are willing to reject.

  • +1 recalling and applying definitions for terms is an excellent skill to build in undergraduates and students in science. – AdamO May 31 '23 at 13:11
2

The standard mnemonic for remembering how to make a conclusion in a hypothesis test is:

If p is low, the null must go!

As to why this is the case, the best explanation of a classical hypothesis test is that it is the inductive anologue of a proof by contradiction. In a proof by contradiction we begin with a null hypothesis, show that this leads logically to a contradiction, and therefore reject the initial premise that the null is true. In a classical hypothesis test, we begin with a null hypothesis, show that this leads to a highly implausible result in favour of the alternative (so not quite a deductive contradiction, but close), and therefore reject the initial premise that the null is true. The p-value in this test is the probability of a result at least as conducive to the alternative hypothesis, assuming the null is true (see formal explanation here). If this is low then it means that something very implausible happened (under the assumption that the null is true) which gives the "contradiction" in the "inductive proof by contradiction".

Ben
  • 124,856
  • Thanks for the mnemonic - if anyone knows a Danish one, please post here, too :-) – Ute May 29 '23 at 10:57
  • 1
    It's not an answer but I sometimes emphasise that there is nothing unusual about a figure of merit where low means good, as witness unemployment, inflation, goals let in by your favourite team, number of criminal convictions. – Nick Cox May 29 '23 at 11:12
  • "good" is subjective. If you want to defend the null hypothesis for whatever reason, then large p-values are good... – Ute May 29 '23 at 12:17
1

Fisher is said to have given the interpretion of $p$-values as a "measure of surprise", given you believe in the null hypothesis. This may actually be confusing, since low $p$-value then indicates strong surprise.

Instead, we can introduce $p$-values as "measure of compatibility with the null". (suggested by Christian Hennig) Then: low p = low compatibility

In Cox And Hinkley's 1974 text, they use the p-value "as a measure of the consistency of the data with the null hypothesis" (p.66). Earlier, Cox (1958) described a significance test as "concerned with the extent to which the data are consistent with the null hypothesis".(p. 362)

Ute
  • 2,580
  • 1
  • 8
  • 22
  • Should "trust" be replaced by "belief" or something else (non native speaker) – Ute May 29 '23 at 10:59
  • 4
    Terms with psychological overtunes already hinder more than they help, Confidence and significance already have done much harm because confidence in a statistical sense is nothing do with what the researcher feels or should feel and significance is just a vague word that could mean almost anything. There is no easy solution here because completely new words or new technical jargon e.g. calling a confidence interval a Neyman interval would both be obnoxious in different ways. – Nick Cox May 29 '23 at 11:09
  • I agree with @NickCox. On another note, for a historical take, refer to the source in my old post here. – User1865345 May 29 '23 at 11:14
  • @NickCox Thank you! The problem that I see for my students is that they read "measure of surprise" everywhere, in textbooks, internet... - how can I most effectively counteract the intuition "low p-value = no surprise = null seems ok" ? – Ute May 29 '23 at 11:19
  • I don't see much use of "surprise" in what I read and I certainly don't recommend it. For a start a null hypothesis could just be a sceptical or cautious hypothesis that you expect to overturn, and there is not so much surprise but more disappointment if that doesn't happen. No; P-value is used precisely because it alludes to calculating a probability. – Nick Cox May 29 '23 at 11:54
  • @NickCox "Disappointment", yeah. but it is as psychological and as misleading as "surprise" (depending on the standpoint of course). – Ute May 29 '23 at 11:59
  • Indeed; but I am not recommending the use of "disappointment" for any teaching or reporting, just underlining that "surprise" is not the only dimension even psychologically. – Nick Cox May 29 '23 at 12:11
  • 3
    Never ever should we trust in a null hypothesis (which implies all model assumptions), regardless of the p-value. Neither should we believe it. Re "measure of agreement" - what about "compatibility" (as used in the ASA recommendations about p-values)? – Christian Hennig May 29 '23 at 14:27
  • +1 and thanks @ChristianHenning for compatibility. In Danish taught class, I called p value as a measure of how much the observed value "stemmer overens" with the null hypothesis. Perhaps it helps to rename this into "er kompatibel" :-) – Ute May 29 '23 at 14:43
  • 1
    I think a safe way to view the p-value as measuring the "(statistical) consistency of the data with the null hypothesis". Lower value means less consistency. With this view it is hard to make the common misinterpretations. In particular, it makes it clear that p-value is about the absolute plausibilty of the null, and is not a measure of relative plausibility of the null. – Graham Bornholt May 29 '23 at 20:58
  • @GrahamBornholt thank you for "consistency with null" :-). I think we could collect good expressions that help against understanding $p$-values the wrong way (low = do not reject $H_0$), therefore I made this answer a wiki – Ute May 29 '23 at 21:24
  • The consistency interpretation goes back a long way. It's in the Cox and Hinkley (1974) text, for example, and in other Cox papers so it is probably due to him. – Graham Bornholt May 29 '23 at 22:11
  • 2
    The students should be exposed to the much more intuitive and actionable Bayesian approach IMHO. But if only teaching the classical frequentist approach I'd do everything possible to not teach hypothesis tests but rather them them interpretations as the following. "The ... test resulted in p=0.04 representing modest evidence against the supposition that there is no effect... The test resulted in p=0.11 indicating little evidence against the supposition that the treatment is ignorable...".For both of these examples the phrase "assuming model assumptions hold" needs to be appended. – Frank Harrell May 31 '23 at 12:41
  • @FrankHarrell Something that appears to be difficult about p-values is the apparent change of direction: (p-value = evidence against $H_0$): small = strong, large = little. Therefore it would be easier to have something like "support for", but this is un(frequentist)orthodox; (p-value = support for $H_0$): small = little, large = strong.

    Here is still frequentist-land, at least for first year students. I remember though, the frequentist approach felt unnatural to me at my first encounter (in high school) with statistics.

    – Ute May 31 '23 at 13:47
  • Good point. It is often difficult for students to learn this but it must be emphasize. You have to think backwards to make null hypothesis tests as frequentist methods never give you evidence in favor of something. Contrast that with Bayes where you get direct evidence in favor of every possible assertion whether that assertion is a null-type or a big-effect-type. – Frank Harrell May 31 '23 at 13:58
1

I've found that some of my students are helped by thinking of the p-value as a percentile. They are familiar with the concepts of being in the top 10% of a class by GPAs, or "among the 1%" in terms of wealth.

So for your example, a p-value of 0.04 means "Our observed value of the test statistic $T$ was among the top 4% possible values of $T$ under $H_0$ that are least like $H_0$ and most like $H_A$."

In other words, "Our observed test statistic was among the top 5% most un-$H_0$-like values, but not among the top 1%."

civilstat
  • 3,785
  • +1 You seem to be tapping into the psychological literature that suggests people don't think accurately about probabilities but can do very well when the same thing is cast in concrete terms of "out of every 100, ...". To take this to its logical conclusion, then, you might consider stating "This value of $T$ is evidence against $H_0$ in the sense that when $H_0$ is the case, we expect only four out of every 100 experiments to exhibit a value of $T$ that counts against $H_0$ as strongly as the observed value does, while there is some $H_A$ more likely to produce such a strong value of $T.$" – whuber May 31 '23 at 19:40
  • @whuber Thanks! I'm somewhat aware of the psych literature you mention (work by Tversky & Kahneman and others), but I don't know it very deeply. I suspect that my suggested wording ("among the top 4%...") taps into slightly different intuition than your suggested wording ("only 4 out of every 100..."), but I would be really curious to know if they have been compared in the psych literature. – civilstat May 31 '23 at 20:27
  • I believe Gerd Gigerenzer has written about this. – whuber May 31 '23 at 22:01