19

When we fail to reject the null hypothesis in hypothesis testing, which of the below is the best interpretation?

  • We have no evidence at our significance level $\alpha$ to reject $H_0$.
  • We have insufficient evidence at our significance level $\alpha$ to reject $H_0$.

I have seen no evidence used frequently, but insufficient evidence seems much better to me. Say we get a $p$-value of $p$ in a hypothesis test. If we'd happened to have chosen an $\alpha$ greater than $p$ then we'd have evidence to reject $H_0$ at that $\alpha$, but if we'd happened to have chosen an $\alpha$ less than $p$ we'd fail to reject that that $\alpha$. In both cases, we have the same amount of evidence against $H_0$ (since we have the same $p$-value), we'd just used a different threshold between the two cases. So to me it makes much more sense to say we have sufficient or insufficient evidence at our $\alpha$, rather than no evidence, since our interpretations are $\alpha$ dependent and not entirely $p$-value dependent.

  • 2
  • 2
    If you look at Alexis's linked answer, please also read the one below it! – Michael Lew Apr 30 '23 at 21:28
  • 1
    What is your null hypothesis? If your null hypothesis is that some population parameter $\theta$ is less than 0, and you take an unbiased random sample and measure $\hat\theta$ to be greater than 0, then I would say that you have some evidence that $\theta > 0$ OTOH, if your null hypothesis is that the distribution of $\theta$ is normal, and you take an unbiased random sample and find your sample distribution to be non-normal, well.... of course it's non-normal: it's a discrete distribution! Is that some evidence? Maybe that's a subjective call. – Him May 01 '23 at 01:34
  • 1
    The key question is: what does evidence mean to you? Is it a belief? Is it a probability? Is it even well defined? Does "evidence" make sense when the space of alternative hypotheses include values that are arbitrarily close to the null, yet the totality of decisions is either accept vs. reject the null? You can skip the word altogether such as "...a p-value of 0.053 which was not statistically significant at the 0.05 level." – AdamO May 01 '23 at 16:35
  • 1
    The ASA statement on $p$-values points out that $p$-values are weakly connected to "evidence" (of a hypothesis being true). The problem is that both statement attempt to quantify evidence from the $p$-value: No evidence = 0 evidence. Little evidence = an $\epsilon$ of evidence. And what would you say about your "evidence" if you conducted a one tailed test and the finding were in the opposite direction of your expectation? Is that negative evidence :) ? – AdamO May 01 '23 at 16:39
  • @AdamO Good points. What do we mean by evidence? It's not really a probability,because the p-value is the probability of the data if the null was true, and we're making statements about the evidence of the null or alternative being true in hypothesis testing – Alex Michael May 02 '23 at 00:40
  • @AdamO You make a good point: the word 'evidence' is not enough without the specification of whether the evidence is favours or disfavours the hypothesis under question. "Negative evidence" would be a very unfortunate way to describe evidence against something, I think. – Michael Lew May 02 '23 at 00:47
  • 1
    @AlexMichael Whoops! The p-value is NOT "the probability of the data if the null was true". However, the likelihood function would include a point that is proportional to that, and a comparison (ratio) of that likelihood with others on the same likelihood function would give you a straightforward and technically valid impression of the evidence in the data concerning the parameter values of the statistical model. The relationship between p-values and evidence is complicated, but likelihood express evidence directly. – Michael Lew May 02 '23 at 00:51
  • @AdamO Ah yes, sorry, loose language. "The p-value is the probability of observing our data or data further from the null value, if the null was true" would be better – Alex Michael May 02 '23 at 01:09

8 Answers8

14

In my experience, insufficient evidence is the least ambiguous and most often used way to describe the inability of rejecting $H_0$. The reasoning in my mind being that in statistics, we hardly ever deal with absolutes. That said, this is more an interpretation of language. We can think of a test that fails to reject $H_0$ having no evidence at it's current state (given the current data, specific test, and set thresholds). That said, the problem with this is that at first glance (to someone not too familiar with hypothesis testing for instance) it glosses over that our test is only as precise or correct as our data/test/threshold allows it to be.

That is why I agree with you that "insufficient" is a better way of communicating a failure to reject. That said, this may be a difference in language between different fields.

One thing to note: I do feel that your reasoning of switching $\alpha$ in regards to evidence is not entirely correct. A significance level is set before the test occurs and stays set, otherwise the test's conclusions become muddled. One way to gain more evidence is to find more data related to what is being tested.

392781
  • 160
  • 2
    In terms of the comment around choosing $\alpha$ after the hypothesis test, thanks for pointing that out. I was more meaning if we'd happened to have chosen an $\alpha$ before the test that turned out to be less than our $p$-value, rather than deliberately choosing $\alpha$ to suit the $p$-value we get after the test. I've edited the post to reflect this. – Alex Michael Apr 30 '23 at 06:23
  • 2
    No worries. I figured it wouldn't hurt to point out just in case. – 392781 Apr 30 '23 at 18:49
14

The sentence "... evidence to reject $H_0$" does not make much sense to me because you either reject $H_0$ when $p\leq\alpha$ or you don't. It's your decision to reject or not reject. "Rejection" is not an inherent propery of the $p$-value because it requires an additional criterion set by the researcher.

What makes more sense is to talk about the evidence against the null hypothesis provided by the $p$-value. If we adopt the view$^{[1,2]}$ that the $p$-value is a continuous measure of compatibility between our data and the model (including the null hypothesis), it makes sense to talk about various degrees of evidence against $H_0$. Personally, I like the approach of Rafi & Greenland$^{[1]}$ to transform the $p$-value into (Shannon) surprise as $s=-\log_2(p)$ (aka Shannon information). For an extensive discussion on the distinction of $p$-values for decision and $p$-values as compatibility measures, see the recent paper by Greenland$^{[2]}$. This provides an absolute scale on which to view the information that a specific $p$-value provides. If a single coin toss provides $1$ bit of information, a $p$-value of, say, $0.05$ provides $s=-\log_2(0.05)=4.32$ bits of information against the null hypothesis. In other words: A $p$-value of $0.05$ is roughly as surprising as seeing all heads in four tosses of a fair coin.

This approach makes it very clear that the evidence provided by a $p$-value is nonlinear. For example: A $p$-values $0.10$ provides $3.32$ bits of information whereas a $p$-value of $0.15$ provides $2.74$ bits. The first $p$-value thus provides roughly $21$% more evidence against $H_0$ as the second. In a second example, a $p$-value of $0.001$ provides roughly $132$% more evidence than a $p$-value of $0.051$, despite the absolute difference between them being the same as in the first example ($0.05$). Here is an illustration from paper $[1]$:

RafiGreenland

To answer the question: As long as the $p$-value is smaller than $1$, it provides some evidence against the null hypothesis because it shows some incompatibility between the data and the model. To say "no evidence" would therefore not be entirely accurate.

References

$[1]$: Rafi, Z., Greenland, S. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med Res Methodol 20, 244 (2020). https://doi.org/10.1186/s12874-020-01105-9

$[2]$: Greenland, S. (2023). Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or, how divergence P-values measure evidence even when decision P-values do not. Scand J Statist, 50( 1), 54– 88. https://doi.org/10.1111/sjos.12625

COOLSerdash
  • 30,198
8

It might be helpful to distinguish between the "objective" and "subjective" parts of statistical testing. You assume a null hypothesis $H_0$, observe data, compute a statistic, and obtain a $p$-value. You might not have used the "optimal" statistic, obtained the sharpest probabilistic bounds, etc. but there is a fixed process that transforms the data into a $p$-value based on $H_0$. At this point, the $p$-value is your "evidence," and its strength is inversely proportional to its magnitude.

Now, "rejecting" the null hypothesis based on a pre-chosen value of $\alpha$ is somewhat objective, as it is based on your intuition on "how much evidence is enough evidence". Picking $\alpha$ after seeing the $p$-value is problematic because you willingly influence the outcome by varying $\alpha$, i.e. you are able to "move the goalposts".

Ultimately, I'd agree with 392781's answer, that there is "insufficient evidence," provided you have defined in advance what "sufficient evidence" would look like, in the form of picking $\alpha$. Still, it's helpful to remember that "evidence" is not a perfect word here, because it is often used to refer to discrete, objective reasoning, rather than probabilistic heuristics.

7

This is to some extent similar to some other answers, however I feel still worth saying.

What I teach (and have seen elsewhere) is to either test at a fixed level $\alpha$, or to use more graded "evidence language". If we fix a level, I'd just say "We do not reject at level $\alpha$" (or we do, of course). Maybe (if you want to bring the term evidence in), "there is no significant evidence" (at level $\alpha$; unless there is).

Alternatively I'd interpret test results in a non-binary way saying "There is very strong/strong/modest/weak/no evidence" for p<0.001/0.01/0.05/0.1/p>0.1.

I don't like the term "insufficient", as it seems to suggest that we wanted to reject but failed to do so (same with the wording in the question "fail to reject"), whereas I think that a scientist should be open to any result rather than hoping for significance (even though in many cases it may be arguably more honest to say something like "I wanted significance so much but didn't get it, boohoo", in which case the researcher probably better says it this way so that people know what to think of the researcher's neutrality...).

  • Thanks for the post. So you'd say the wording "no evidence" rather than "insufficient evidence" is better because "insufficient evidence" encourages the bad practice or setting out to reject the null/get a positive result? – Alex Michael May 02 '23 at 01:14
  • 1
    @AlexMichael Yes. – Christian Hennig May 02 '23 at 18:10
  • 1
    @AlexMichael However I would not use "no evidence" whenever a $H_0$ is not rejected. For example if p=0.03 I'd use "moderate evidence" regardless of at what $\alpha$-level you might want to test. The strength of evidence implied by $p=0.03$ does not depend on the specified level of the test. – Christian Hennig May 02 '23 at 20:38
  • Thanks @Christian Henning. Agree with you on that, and disagree with Ben's answer below on that point. – Alex Michael May 02 '23 at 22:15
4

The two sentences have nearly the same meaning.

The phrase with 'insufficient' is just placing more stress on the idea that there is a gradual range of evidence, and that there is a 'boundary for the amount of evidence' that has not been passed.

The other phrase can be regarded as a shortened/abbreviated sentence saying more or less the same "We have no evidence (that is sufficient)".

The second case has the same meaning but is just stated in a different way.

4

Unless the experiment or study result showed a parameter was exactly equal to the Null Hypothesis, then you do have some evidence against the Null. If you have establish a threshold for the p-value was is "sufficient" evidence and that p-value is greater than you threshold , then you have "insufficient" evidence. The p-value is really a feature of the data.

It is was Neyman and Pearson who formulated hypothesis testing as an accept-reject paradigm. Up to that point (1940's ???) The Fisherian formalism was to report the p-value and let that speak for itself. Fisher attacked the N-P formalism vociferously. And that didn't end the argument because the Bayesians were yet to be heard.

I think it’s interesting that this is generating some discrepant responses. So far 6 upvoted and 7 downvotes. I thought it was a well established principle that the Null was almost never “true”.

DWin
  • 7,726
  • 1
    "Unless the experiment or study result was exactly equal to the Null Hypothesis" - what does this mean? An "experiment or study result" is not a hypothesis. – Christian Hennig May 01 '23 at 00:04
  • The null is usually of the form mu = 0 where mu is a mean or a difference in means or some other parameter. The result is not usually zero but rather something not zero. So the hypothesis is usually not correct. And the is some evidence, possibly weak that the Null is incorrect. The same holds for one-sided tests. – DWin May 01 '23 at 01:55
  • 1
    @ChristianHennig The null hypothesis typically says that a parameter of the statistical model takes a specified value. The phrase in question implies that the experimentally observed estimate of that parameter value is equal to the hypothesised value. – Michael Lew May 01 '23 at 04:33
3

The accept/reject procedure of a hypothesis test is only designed to yield the long run error rate properties of the test. It deals with 'evidence' in the data only vaguely and only to the extent that it gives a decision that the evidence is strong enough (according to the pre-data specified level of alpha) to require the null hypothesis to be discarded, or not strong enough. It does not, by itself or by design, provide for any evidential assessment beyond that. However...

The hypothesis test method published by Neyman & Pearson did not depend on a p-value (and did not provide one), but modern usage of the hypothesis tests almost always involves comparing a p-value to a threshold rather than looking to see if the test statistic falls in a "critical region". It is the p-value that lets you make statements about the strength of the evidence in the data against the null hypothesis, according to the statistical model.

The p-value is best understood as a product of a (neo-) Fisherian significance test rather than a hypothesis test or the hybrid thing often called 'NHST'.

To some the distinction seems subtle and rather pointless, but if you want to know what the testing procedures allow you to know and the types of inferences that they support I think the distinction is essential. See here for my extended take on the topic: https://link.springer.com/chapter/10.1007/164_2019_286

If you want to talk of evidence and to persist with the conventional accept/reject approach then you need to know that, depending on the alpha that you choose and the experimental design, you may be rejecting the null hypothesis with fairly weak evidence or with very strong evidence.

Michael Lew
  • 15,102
3

My preference is to use "no evidence"

The testing in a classical hypothesis test is a binary decision, so in this context I prefer to use "no evidence" vs "evidence". It is best not to conflate the decision to reject the null hypothesis (which is fixed by the data and has no uncertainty) with the underlying truth or falsity of the hypotheses (which is uncertain). For that reason I would recommend you avoid talking about "evidence to reject" and instead use wording that either refers to evidence in favour of the alternative hypothesis, or the actual rejection decision that was made:

  • We found no evidence in favour of $H_A$ at the significance level $\alpha$.

  • We reject $H_0$ in favour of $H_A$ at the significance level $\alpha$.

  • We found evidence in favour of $H_A$ at the significance level $\alpha$.

  • We do not reject $H_0$ in favour of $H_A$ at the significance level $\alpha$.

Alternatively, you can build in the "statistically significant" description:

  • We found no statistically significant evidence in favour of $H_A$ (at the $\alpha$ level).

  • We found statistically significant evidence in favour of $H_A$ (at the $\alpha$ level).

Alternatively, in many contexts it is more sensible to just state the relevant p-value and characterise the evidence without use of a specific significant level:$^\dagger$

  • We found no evidence in favour of $H_A$ ($p=0.3255$).

  • We found weak evidence in favour of $H_A$ ($p=0.0341$).

  • We found strong evidence in favour of $H_A$ ($p=0.0076$).

  • We found very strong evidence in favour of $H_A$ ($p=0.0008$).

The main reason I prefer not to use "insufficient evidence" is that it suggests some evidence in favour of the alternative hypothesis when that may not be the case. For example, if you have a p-value of $p=0.3255$, that means that if the null hypothesis is true, almost one-third of the time you would see a result that is at least that conducive to the alternative hypothesis. My view is that this is accurately characterised as "no evidence", not "insufficient evidence to reject".


$^\dagger$ Here I use my own assessments of the strength of evidence, to wit: "weak" for p-value between 0.01-0.05, "strong" for p-value between 0.001-0.01, "very strong" for p-value of 0.001 or lower. Others may take a different view of the appropriate correspondence, but so long as you state the p-value, it should be fine.

Ben
  • 124,856
  • 3
    Saying "no evidence" in favor of the alternative without qualification is a bad idea, especially when the point estimate favors the alternative. Better to say "no statistically significant evidence ..." – Graham Bornholt May 01 '23 at 03:27
  • 1
    Bad idea, as @GrahamBornholt says. There is evidence if there is data, but that evidence may speak in favour or against, strongly or weakly. If you do an experiment and end up with no evidence relevant to the null hypothesis then you've done the wrong experiment! – Michael Lew May 01 '23 at 04:36
  • 3
    Saying "no evidence ... at significance level $\alpha$" is the qualification. – Ben May 01 '23 at 04:46
  • 1
    @MichaelLew: That seems quite a tendencious interpretation of what I have written, given that I consistently refer to "evidence in favour of $H_A$". – Ben May 01 '23 at 04:47
  • 1
    My comment is aimed at statements of 'no evidence' and I'm not sure that the hypothesis being the null or an alternative makes any difference. – Michael Lew May 01 '23 at 07:39
  • When you refer to $H_{A}$, do you mean the complement of the null hypothesis or the point 'alternative hypothesis' used for pre-data sample size evaluations? I assume that it is the former, but I am not sure that it is particularly helpful to think of evidence being in favour of a composite hypothesis. Surely the evidence will favour some regions within the composite and disfavour others. – Michael Lew May 01 '23 at 07:44
  • @Ben Thanks for your reply. How does using "no evidence" rather than "insufficient evidence" help us make the distinction between the decision to reject the null and whether it is true or not? And in terms of saying "we have no evidence at that $\alpha$" when the p-value is greater than alpha, isn't "insufficient evidence" better because we would have evidence at a higher $\alpha$ ("no evidence", even if qualified with "no evidence at $\alpha$", seems to be an objective statement of there being no evidence fullstop) – Alex Michael May 02 '23 at 01:41
  • Further to @MichealLews, surely a hypothsis test gives you no information about any alternative hypothesis, only about the null. The only time we really gain anything about the alternate, is if it nothing more that "not the null". But that could include all sorts of things, such as the distributional famility of the test statisitc being different from the null, or nuissence parameters being different, as well as the parameter of interest being different. – Ian Sudbery May 02 '23 at 12:24
  • Saying ”no evidence” is misleading the audience when there is a large effect but an insufficient sample size to allow the p-value to reach significance. Better to say “we did not find evidence for the Ha at the preset alpha”. – DWin May 02 '23 at 20:45
  • @DWin: That locution is almost identical to what is said in the answer. – Ben May 02 '23 at 22:25
  • Perhaps it should have been “We did not find sufficient evidence to meet the preset criterion for statistical significance.” – DWin May 03 '23 at 20:06