Quantitatively distinguishing between multiple-comparison tests

Question

As someone relatively new to applied statistics, I have been trying to better understand some of the best practices for applying statistical methods. I have been recently trying to understand when to use different multiple-comparison procedures.

The texts I read suggest using the method most appropriate for the testing data and goals. In these texts, I frequently come across algorithms such as Bonferroni and Tukey (for controlling FWER) and Benjamini-Hochburg (for controlling FDR) with different recommendations for each. They describe Bonferroni as the most conservative, and Tukey HSD as "moderately" conservative. In cases where there are "many" comparisons, techniques to control the FWER become overly conservative, and Benjamini-Hochburg is often recommended as a way out.

I am uncomfortable with these vague recommendations. What does "many treatments" mean? What is a low or high sample size, especially when the sample sizes can vary by treatment? Ideally, I would like to know not only $\alpha$ but also $\beta$, so I could understand precisely the conservatism of these tests with my treatments. To my surprise, this matter is rarely discussed in the literature. What is the state of the art for deciding which test is appropriate for a certain data set?

Often the more important question is whether to use such a procedure. Frequently the answer to that is no. See here for a brief answer to that: https://stats.stackexchange.com/questions/630316/how-many-p-value-observations-do-you-think-are-required-before-doing-fdr-correct/630324#630324 — Michael Lew, Feb 06 '24 at 06:02
@MichaelLew I would consider this to be the least conservative situation, and as you describe, the advantage is a lowered type II error. This leaves me with the same problem: how can I understand this tradeoff concretely so I can make an informed decision? — Argon, Feb 06 '24 at 16:24

score 1 · Answer 1 · answered Feb 09 '24 at 15:59

Values such as $\alpha$ and $\beta$ (if they mean what I think they mean; better explain all notation because these letters don't necessarily have the same meaning in all literature) cannot be known but are to be decided, optimally in a case-dependent manner, which means that there cannot be general answers on how such decisions are to be made. You always need to take into account the consequences of any kind of "error" in the given situation, to decide how you should limit the error probabilities. Also this may have implications regarding the relevance of FWER and FDR.

One thing you can do in order to understand the trade-off is to simulate artificial data sets from models that seem relevant in the given situation (null hypothesis, specific alternatives where you want to reject H0 with large probability, mixed situations), and then see how the methods and choices of constants such as the type I and type II error rate (family-wise or other) play out. So you start by making up a generative model that you think could generate "realistic" data and where you can control the truth, and you then see what happens and whether this seems appropriate. What model counts as realistic will always be dependent on the specific situation. Also you can vary "number of treatments" and other aspects you are interested in.

(This is not meant to be an exhaustive answer, just to mention one somewhat underused tool in the toolbox.)

Thanks for the interesting suggestion. I have never seen this practice recommended by textbooks, or used in papers. Why would this be the case? Do researchers just rely on intuition and rules of thumb to decide which test has an acceptable type II error? — Argon, Feb 09 '24 at 21:57
@Argon Ultimately I can't tell you, but I think many textbooks and papers are eager to present statistics (and science generally) as something "objective" and do not like to go on about how to make decisions that are not objective in the sense that they rely on background information and the aim of the researcher. Andrew Gelman likes to go on about "fake data simulation" https://statmodeling.stat.columbia.edu/2023/02/28/whats-so-funny-bout-fake-data-simulation/ — Christian Hennig, Feb 09 '24 at 22:05

Quantitatively distinguishing between multiple-comparison tests

1 Answers1