What is the "right" way to define "extreme" in a hypothesis test?

Question

In frequentist hypothesis testing, the value $P(\text{data as extreme as observed} | H_0)$ is calculated, and if it is less than a certain threshold, $H_0$ is rejected. I understand that the notion of "extreme" needs to be defined, which depends on $H_1$. It might be defined explicitly using likelihood ratio, but often, as far as I have seen in the literature (in my case, biology), the dependence on $H_1$ is implicit. This raises the question: is what people use the "right" definition of extreme?

Now, this might sound like I'm being pedantic. But the reason I am asking this is that, on my dataset I am analyzing, I defined a statistic and an intuitive definition of extreme, and using sampling I calculated the pvalue $P(\text{data as extreme as observed} | H_0)$. I was all happy when I saw that this value was very small. Since my alternative hypothesis $H_1$ was also well defined, I could calculate $P(\text{data as extreme as observed} | H_1)$ as well, and saw that this value was also small. This made me wonder:

How can I interpret this? Intuitively, this makes me think that neither $H_0$ nor $H_1$ are good hypotheses, and the "reality" (loosely defined, of course), is something else. Another interpretation I can think of is that my data (or my summary statistic at least) is not informative about these hypotheses. Another way to see it is that I rejected the composite hypothesis $H_C:H_0 \lor H_1$ (if the sum of those probabilities are also small).
If I do not report $P(\text{data as extreme as observed} | H_1)$, and only report the pvalue, this might actually convince at least some people that I have been able to reject $H_0$, and since the alternative that intuitively led to the test was $H_1$, this might seem as an evidence for $H_1$ to those people. This tells me that there might be at least some papers out there that have a similar problem. Has anyone looked into this? Is my concern justified? If this is in fact a point of concern, what is the remedy (besides abandoning the notion of pvalue altogether and going for Bayesian testing)? Should people be asked to always calculate and report $P(\text{data as extreme as observed} | H_1)$ as well as pvalue? Would that solve the issue?

There is a misunderstanding here. The notion of "extreme" does not depend on H1. There are two different notions of H1 / alternative hypothesis: 1) relative to traditional frequentist hypothesis testing, is the complement of the null, eg, H0: x1=x2, H1: x1=/=x2; 2) a specific, scientific belief that can be used in designing a study, eg, for power analyses. — gung - Reinstate Monica, Feb 14 '24 at 18:37
Dependence of "extreme" on H1 is not essential to my question. But my understanding comes from https://stats.stackexchange.com/a/561866/312367 — Feri, Feb 14 '24 at 18:39
I think that you should say what your hypotheses are and the statistical model. Attempts to answer your question based on generalities are likely to miss help with misconceptions. — Michael Lew, Feb 14 '24 at 20:16
Does my answer to this question help? It attempts to clarify the natures of extremeness, p-values, and the alternative. You will have to scroll down to get to it as it has only a few votes. https://stats.stackexchange.com/questions/447510/does-p-value-ever-depend-on-the-alternative/ — Michael Lew, Feb 14 '24 at 20:27
@MichaelLew Thanks. I think the problem partly appears because we never fully test for the complement of $H_0$ (if we do, then, as you mention, we can ignore $H_1$). E.g. when we assume a likelihood distribution family, we are already limiting the entire hypothesis space, and if it is chosen poorly $P(\text{extreme datsa}|H_1)$ might still be as low as the pvalue — Feri, Feb 14 '24 at 20:45
Again, I think that you need to use specifics when writing about these sorts of issues. What is your $H_0$ and what is your $H_1$? In the answer that I pointed to I was much o0re specific than you have been. — Michael Lew, Feb 14 '24 at 20:51
@gung-ReinstateMonica The notion of "extreme" depends on the choice of the test statistic, and at least in Neyman-Pearson theory the choice of the test statistic depends on the $H_1$. Also, rather trivially, the $H_1$ will determine whether a test is one- or two-sided, and this implies what counts as "extreme". — Christian Hennig, Feb 14 '24 at 22:56
Technically you are free to choose $H_0$ and $H_1$ so that both are bad models for the data, and it can well happen that under both of the the probability for "data as extreme or more" is low. However details are required because depending on the exact situation it maybe rather clear what that means or not, and whether it can be "repaired". You may have chosen your $H_0$ and $H_1$ unwisely (but then you may not). — Christian Hennig, Feb 14 '24 at 23:03
"this might seem as an evidence for H1 to those people" - these people exist, and they are the problem. Note though that even if P(as extreme or more|H1) is large, this isn't really evidence for H1, as H1 is a probability model, and therefore, as all models, wrong. (The idea that H0 and H1 can be chosen in such a way that H1 is the "complement" of H0 in the sense that H1 holds if H0 doesn't is wrong. Models do not literally hold in reality, so H0 doesn't hold, and H1 doesn't hold either, and this is clear before we run any test.) — Christian Hennig, Feb 14 '24 at 23:08

Ben · Accepted Answer · 2024-02-14T22:28:40.327

To see if the test is "good" you need to analyse the properties of the power function

As you point out, it is possible to derive a test statistic ---and thereby derive the "evidentiary ordering" defining what is "extreme" in the test--- using formal methods like the likelihood-ratio method. It is also possible to formulate a test statistic on a more intuitive basis. Both of these are allowed, but ultimately the quality of the test is checked by looking at its properties in relation to correct inference under different parameter values. In particular, this generally involves an analysis of the frequentist properties of the power function.

For context, this explanation builds on my explanation of a hypothesis test in this related answer. Suppose your test has an unknown parameter $\theta$ and disjoint hypothesis spaces $\Theta_0$ and $\Theta_1$ corresponding to the two hypotheses. Given a stipulated evidentiary ordering $\succeq$ there is a resulting p-value function $p$ and the corresponding power function (allowing for variable significance level and sample size) is:

$$\text{Power}(\theta, \alpha, n) \equiv \mathbb{P}(\text{Reject } H_0 | \theta) = \mathbb{P}(p(\mathbf{X}_n) \leqslant \alpha | \theta).$$

The power function fully determines the probabilities of Type I/Type II error in the test under each possible parameter value. Consequently, the way we find out if the hypothesis test is any good is by analysing the properties of the power function to see if it gives us nice low probabilities of error. In particular, we look at all the frequentist properties of the test by looking at what happens to the power function over all possible values of $\theta$. One of the most important things we would look at is the consistency of the test, which is a property that depends on what happens to the power as $n \rightarrow \infty$. At a minimum, we want to see the following property:$^\dagger$

$$\begin{align} \lim_{n \rightarrow \infty} \text{Power}(\theta, \alpha, n) &\leqslant \alpha \quad \quad \quad \text{for } \theta \in \Theta_0. \\[12pt] \lim_{n \rightarrow \infty} \text{Power}(\theta, \alpha, n) &= 1 \quad \quad \quad \text{for } \theta \in \Theta_1. \\[6pt] \end{align}$$

This property says that with enough data, the test gets very good at rejecting a false null hypothesis (the probability of Type II error goes to zero in the limit) but it also respects the significance level of the test as the size of the test. There are other properties of the power function that are also valuable properties and we would also look at these. We may even compare two hypothesis tests that use a different definition of "extreme" and find that one has much better properties than another (e.g., one test "dominates" the other in terms of having lower probabilities of Type I and Type II error).

Example of power analysis for a bespoke test: In case it interests you, you can find an example of power analysis for a bespoke test in O'Neill (2020). This paper puts forward a new kind of hypothesis test (to test for periodic signals in data/residuals) where the test statistic is formulated "intuitively" (but based on some related work) and the p-value function is approximated using permutation sampling. This test is sufficiently complex that its power function is difficult to compute exactly, and so it is computed using simulation methods over a set of points of interest.

Section 3 of this paper (pp. 9-13) shows a power analysis of the test to check that the test actually "works" --- i.e., that it does indeed detect periodic signals in data (with a high enough sample size) and it doesn't say that they're there when they're not (at least, not beyond the expected rates of Type I error). As you will see from that section, the analysis involves showing the power of the test over a range of sample sizes and values for the parameter in both the null and alternative regions (under one or more a stipulated significance levels) to see if it is doing what it should be doing. There is also some deeper simulation analysis showing the distribution of the p-value in these cases. The probability of Type I error in the test is held at its appropriate rate by construction and the probability of Type II error is analysed by simulation, by computing the power function at a set of points of interest. What we are looking for in the latter case is to confirm that the power of the test tends towards one under every parameter value in the alternative space as we get more and more data.

In this particular paper, the power function exhibits the type of properties you would want it to have, which gives us confidence that the test is "good". In particular, in Figure 5 (p. 13) you can see that the power of the test increases up towards one as the sample size increases under each parameter value in the alternative space. Moreover, as should be expected, the rate of increase of the power is much higher when the parameter is far away from the null value. Now, that gives us a basic "sense check" of the test, but it doesn't guarantee that there isn't some other test that will dominate the present test. If someone else were to formulate an alternative test for periodic signals in data, it would be possible to compare the power functions of the two tests to see if one of them is unambiguously better than the other (or if they are each better/worse at certain parameter values in the alternative space).

(Another simulated power analysis for a different bespoke test can be found in O'Neill (2023) (or arXiv version, pp. 23-28). I know I'm bombarding you with my own papers, so excuse the self-indulgence; these are just the examples that I'm most familiar with.)

$^\dagger$ It is also worth noting that the power of a test can be computed using a generating mechanism for the data that is outside the stipulated model form for the test. In this case there might not be a clear parameter $\theta$ but there should still be some way to decide whether the null hypothesis is true or false under the alternative model. In this case you would formulate some alternative model $\mathscr{M}$ and compute the power $\text{Power}(\mathscr{M}, \alpha, n)$ by simulating data from this model. You can then compare the computed power to what you would want it to be (depending on whether the null hypothesis is true or false under your new model) to see if the test has good properties when the true model $\mathscr{M}$ falls outside the scope of what was stipulated as the model form when you created the test. This broader kind of power analysis provides information about the "robustness" of the hypothesis test against a failure of its model assumptions. This analysis is just as easy to do as regular power analysis within the model; the only difference is that we simulate the data from a different model instead.

Thank you, this is helpful, and in the lines of what I had in mind. However, as I mentioned, I do not see people actually doing any of this to check the quality of their tests in the literature (even when their test is not a standard t-test with known properties). In those situations, is there still a valid and useful interpretation of their results? I guess what confuses me is why this does not seem to bother anybody? — Feri, Feb 14 '24 at 20:54
@Feri: Understanding the properties of a test is always a good idea as a precursor to its use. However, I suppose the issue you raise here has several elements: (1) basically only theoretical statisticians even know how to analyse the properties of a test well; (2) most tests that people use are things that are buried in the literature and have been analysed before; (3) to the extent that people use new tests, for experts it is sometimes intuitively obvious whether they will have good properties (without formal analysis) and for non-experts they are unaware of the entire issue. — Ben, Feb 14 '24 at 21:31
Thank you. The example paper you mention in your edit is helpful. One point is that, in this type of analysis, the assumption is that the "true" model is among $\Theta_1$. If that is not the case, then the power analysis cannot be completely trusted and should be taken with a grain of salt. — Feri, Feb 14 '24 at 22:10
@Feri: That is not necessarily true. It is actually just as easy to conduct a power analysis under a model that is not the true model. To do this, you simply generate the data for your power analysis using a model that is different to the stipulated model that was used to derive the test. This form of analysis is useful in checking the "robustness" of your test to the model assumption. (In the particular paper linked here, there is no specified model and any possible model is in the alternative space, but it is notable that the simulation analysis is done under two different "models".) — Ben, Feb 14 '24 at 22:16

score 1 · Answer 2 · answered Feb 14 '24 at 20:23

1

Hypotesis testing Is about the value of some parameters and any parameter have his parameters space. Null hypotesis can be about a point in the parameters space or an interval. Alternative hypotesis usually represent the complement. If It Is so, $P(data|H_0)$ and $P(data|H_1)$ cannot be both very low. Moreover consider that the scope of the test Is to say something about $H_0$ not about $H_1$.

answered Feb 14 '24 at 20:23

markowitz

5,519

1

I think even when $H_0$ and $H_1$ are complements, we can still have $P(data|H_0)$ and $P(data|H_1)$ both be low. As an example consider a random sample $x$ from a Gaussian with unknown mean $\theta$. Regardless of the value of $\theta$, $P(x|\theta)$ is zero. What we can't have is $P(H_0|data)$ and $P(H_1|data)$ both be low, as they need to sum to 1. – Feri Feb 14 '24 at 20:28
1

I m not sure what you mean with $p(x|\theta)$. However, focusing on the mean, It seems me not true that the probability to observe the sample remain low regardless the value of $\mu_0$ – markowitz Feb 14 '24 at 20:42
To give a more intuitive example, let's say that our hypotheses are 1. it's going to rain 2. it's not going to rain. and our observation is that we picked a random number between 1 and 1 billion and it turned out to be 79. Under both hypotheses, the probability of observation is $10^{-9}$ – Feri Feb 14 '24 at 20:57
Maybe I understand your point but you have to consider that $P(data|H_0)$ come from the test statistic and not the PDF used at the beginning of the story. – markowitz Feb 14 '24 at 21:14

What is the "right" way to define "extreme" in a hypothesis test?

2 Answers2

To see if the test is "good" you need to analyse the properties of the power function