Are methods of p-value adjustment only necessary when a universal/overall hypothesis is the target of inference?

Question

I am trying to understand more clearly in which cases an adjustment of p-values is necessary. At the moment my reasoning about this can be summarized as follows:

Benjamini and Hochberg (1995) list three problems/scenarios where adjustment of the type I error is called for:

multiple outcomes ("multiple end points problem"),
multiple comparisons ("multiple-subgroups problem") and
"screening problem" (e.g. screenin of multiple predictors).

Some, for instance Perneger (1998) have argued that such adjustment is necessary only in cases in which one wants make an inference about a "universal hypothesis" from multiple tests.

To me, all three of the above scenarios can be understood as such a case: In scenario 1) the overall (=univeral) hypothesis is that the treatment has an effect (on at least one of the outcomes). In scenario 2) the overall hypothesis is that there are (any) differences between conditions. In scenario 3) the overall hypothesis might be that any of the predictors affects the outcome (although that logic is less clear to me in this case).

So does that mean that drawing an inference about such an overall hypothesis (using several seperate tests) is the main/real/only reason for using p-value adjustment methods?

And does that mean that if I strictly avoid inferences about an overall hypotheses of any kind (directly/implicitly or otherwise), then there is no reason to use such adjustment methods?

Or am I misunderstanding something here (and if so, what)?

How would you strictly avoid inferences about an overall hypotheses of any kind? What is the alternative way of interpreting the results of, eg., hypothesis testing with multiple comparisons? — Ryan Volpi, Jun 24 '22 at 12:16
@RyanVolpi: If you are implying that avoiding inferences about an overall hypothesis is not always possible and really difficult in other cases, then I agree. With regards to scenario 1) I guess it means to report for each outcome if the null could be rejected or not thereby treating each test as separate inferential target. I guess the crucial point is to avoid any interpretation of the tests as an answer to any overall hypothesis. Maybe it would even help to explicitly state in the report that the unadjusted separate tests do not allow any overall conclusion. — MrMax, Jun 29 '22 at 10:30

DevD · Answer 1 · 2022-07-01T05:27:28.110

2

Think of it like this. In a one-way ANOVA setup, you are testing whether the population means of more than two independent groups are equal. Clearly, if there are only two independent groups, the testing problem will simply be an independent sample t-test. Otherwise, if you found significance in ANOVA, you have to proceed with a pairwise t-test with p-value adjustment methods. That means if you don't have to test a universal hypothesis using multiple testing methods, the p-value adjustment methods won't be necessary.

Hope it helps!

EDIT: For further reference, I think the following link will be helpful. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6099145/

EDIT no. 2: As pointed out by the author of this question, the comments under my response are more efficient in explaining the concept than the original response itself, so I'll add my comments here.

If I go a little technical, the main reason for adjusting p-value is to control the family wise error rate or experiment wise error rate. And it happens when multiple tests or multiple comparisons are performed. That's why you don't need it when you are testing an universal hypothesis without employing any multiple comparison or testing procedure.

Suppose you want to check whether an aero plane functions properly. Surely there are methods where you can check whether the whole plane is in good condition to fly. But if you go for a thorough procedure, you would want to check it part by part. Now when you are checking each parts of the aero plane separately, and trying to infer based on them about the whole aero plane, you'll have to account for some adjustments keeping each parts in mind. Otherwise might conclude the aero plane is ready to fly when in reality its wing is not in balance with its empennage or something like that.

edited Jul 01 '22 at 05:27

answered Jun 24 '22 at 09:09

DevD

115

Thank you for your response. I am not sure I understand it correctly. Are you saying that because I should follow up a significant ANOVA with pairwise comparisons with adjustment, I don't need adjustment in cases where I don't have to test a universal hypothesis? Isn't that a weird conclusion? Why should the need to use adjustment in one case allow me to conclude that it is unnecessary in other cases? – MrMax Jun 29 '22 at 08:59
I looked at the article you have linked to. In the title and abstract the authors claim to present "why and when" to adjust p-values. However, I was a bit disappointed when reading the text as I feel like they have not really done that. Instead I think they have presented an example analysis (about gene expression) in which multiple tests are undertaken which means that the overall type I error rate is inflated and therefore adjustment is called for. They do not specify the target of inference/overall hypothesis for which the correction is necessary. [continued in the next comment] – MrMax Jun 29 '22 at 09:12
I also think that aside from the mentioned example they do not really describe/specify when (i.e. in which cases) adjustment is necessary and in which cases it is not. That is exactly the problem I have with a lot of the literature on this (and the reason for me to post this question): I do want a potentially incomplete list of examples where adjust is called for. Something like that is only helpful if my analysis is like one of the examples on the list. Instead I want to understand why and when to use it, so I can make the decision myself even if my analysis differs from the examples. – MrMax Jun 29 '22 at 09:14
@MrMax If I go a little technical, the main reason for adjusting p-value is to control the family wise error rate or experiment wise error rate. And it happens when multiple tests or multiple comparisons are performed. That's why you don't need it when you are testing an universal hypothesis without employing any multiple comparison or testing procedure. – DevD Jun 29 '22 at 10:27
@MrMax suppose you want to check whether an aeroplane functions properly. Surely there are methods where you can check whether the whole plane is in good condition to fly. But if you go for a thorough procedure, you would want to check it part by part. now when you are checking each parts of the aeroplane seperately, and trying to infer based on them about the whole aeroplane, you'll have to account for some adjustments keeping each parts in mind. Otherwise might conclude the aeroplane is ready to fly when in reality it's wing is not in balance with it's empennage or something like that – DevD Jun 29 '22 at 10:37
Thank you for further explaining your answer. So, I now understand that you are saying that I do not need adjustment if I test the overall hypothesis using a single test (/an omnibus test like an ANOVA) instead of doing multiple comparisons, because only in the presence of multiplicity the (family wise) type I error rate is inflated meaning it becomes higher than the intended significance level. Right? – MrMax Jun 29 '22 at 10:40
Precisely what I'm trying to say! – DevD Jun 29 '22 at 10:41
To me this makes your response a helpful one as it points out an alternative method (using an omnibus test) which avoids the problem/necessity of multiple comparisons. It does not directly answer the questions that I posted, though. Also, as I only fully understand the point of your original response through these comments (and others might feel the same), it might be helpful to adapt your original response? – MrMax Jun 29 '22 at 10:51
@MrMax I will add my comments in the original response – DevD Jul 01 '22 at 05:20

score 2 · Answer 2 · answered Jun 24 '22 at 11:35

2

I think you are essentially correct. As you can see, the common thread in the three scenarios you describe is that there are multiple comparisons to make some overall inference. When you perform hypothesis testing with multiple comparisons, you generally end up looking at the test with the lowest p-value, because this is the one with the greatest evidence for a deviation from the null hypothesis. This means that you are effectively "optimising" over multiple tests. The null distribution of the lowest p-value is not uniformly distributed, and its distribution tends to put heavy density on lower values. Adjustment is then required to get an "overall p-value" that accounts for this and moves the null distribution back to a uniform distribution (hopefully).

answered Jun 24 '22 at 11:35

Ben

124,856

That sounds like an interesting perspective from which to look at this. However, I am struggling to understand it. I have a couple of questions (I numbered them for more convenient referencing).
1. You are saying that for a decision about the overall inference, I look at the lowest p-value. Is it not more in line with the logic of significance testing to look for whether there is (at least) one p-value below the significance threshold or not? So is it not more intuitive to think about a binary (Bernoulli?) distribution?
– MrMax Jun 29 '22 at 09:50
I guess you are implying that multiple testing can be viewed from an optimization perspective. Do you mean that the parameter I am optimizing over is the choice of the comparison? And the function I am optimizing is the p-value associated with each choice? Let's say I have a case where the first of 10 comparisons which has a p-value of 0.3 is the one with the lowest p-value. How is the choice of that comparison the relevant parameter with regards to the overall inference? Does it still make sense to see this as a optimization problem? If so, how?

MrMax

Jun 29 '22 at 10:02

Can you explain why the sampling distribution of the (lowest) p-value under the null hypothesis should be uniform? That idea/perspective is somehow new to me. I ran a quick simulation and it seems you are right. Still, I feel that I might not be the only one who does not understand why this should be the case. I feel like some explanation/justification for why the overall inference is "proper" if (and only if) the distribution of the lowest p under the null is uniform, would help a lot to understand the problem via this argumentation.

– MrMax Jun 29 '22 at 10:15

score 1 · Answer 3 · answered Jul 01 '22 at 06:00

Usually I explain it this way: when you have p-value 0.05 it basically means you are wrong 1 in 20 times. On the other hand, our main consideration with hypothesis is to accept or reject them, not probability.

Hence, when you have 20 p-vals 0.05 "on average" you'll accept 1 false positive. This is why you introduce correction. You lose some accepted hypothesis, but you get more accurate results.

This is really also about how you read results. Sometimes (ie. while struggling with results) people will say that due to explanatory characteristics of analysis they omitted correction or they would give both values. This kind of results should be treated with caution. Mostly, while testing multiple hypothesis, you should use correction.

Are methods of p-value adjustment only necessary when a universal/overall hypothesis is the target of inference?

3 Answers3