32

I've been trying to wrap my head around how the False Discovery Rate (FDR) should inform the conclusions of the individual researcher. For example, if your study is underpowered, should you discount your results even if they're significant at $\alpha = .05$? Note: I'm talking about the FDR in the context of examining the results of multiple studies in aggregate, not as a method for multiple test corrections.

Making the (maybe generous) assumption that $\sim.5$ of hypotheses tested are actually true, the FDR is a function of both the type I and type II error rates as follows:

$$\text{FDR} = \frac{\alpha}{\alpha+1-\beta}.$$

It stands to reason that if a study is sufficiently underpowered, we should not trust the results, even if they are significant, as much as we would those of an adequately powered study. So, as some statisticians would say, there are circumstances under which, "in the long run", we might publish many significant results that are false if we follow the traditional guidelines. If a body of research is characterized by consistently underpowered studies (e.g., the candidate gene $\times$ environment interaction literature of the previous decade), even replicated significant findings can be suspect.

Applying the R packages extrafont, ggplot2, and xkcd, I think this might be usefully conceptualized as an issue of perspective: A significant result...

Not so sure...

Given this information, what should an individual researcher do next? If I have a guess of what the size of the effect I'm studying should be (and therefore an estimate of $1 - \beta$, given my sample size), should I adjust my $\alpha$ level until the FDR = .05? Should I publish results at the $\alpha = .05$ level even if my studies are underpowered and leave consideration of the FDR to consumers of the literature?

I know this is a topic that has been discussed frequently, both on this site and in the statistics literature, but I can't seem to find a consensus of opinion on this issue.


EDIT: In response to @amoeba's comment, the FDR can be derived from the standard type I/type II error rate contingency table (pardon its ugliness):

|                            |Finding is significant |Finding is insignificant |
|:---------------------------|:----------------------|:------------------------|
|Finding is false in reality |alpha                  |1 - alpha                |
|Finding is true in reality  |1 - beta               |beta                     |

So, if we are presented with a significant finding (column 1), the chance that it is false in reality is alpha over the sum of the column.

But yes, we can modify our definition of the FDR to reflect the (prior) probability that a given hypothesis is true, though study power $(1 - \beta)$ still plays a role:

$$\text{FDR} = \frac{\alpha \cdot (1- \text{prior})}{\alpha \cdot (1- \text{prior}) + (1-\beta) \cdot \text{prior}}$$

amoeba
  • 104,745
  • It might not give you a definite answer to your question, but you may find inspiration in this text. – JohnRos Apr 28 '15 at 06:07
  • 1
    David Colquhoun's paper that you link to, has very recently been discussed here (with @DavidColquhoun joining the discussion himself), you might be interested to take a look. – amoeba Apr 28 '15 at 16:29
  • 2
    Where does the formula for FDR in terms of $\alpha$ and $\beta$ come from? Perhaps I am being stupid, but I cannot see why it should be true. I would expect FDR to depend on the prevalence of the nulls in the population of studies, which does not seem to enter your formula. I am confused. – amoeba Apr 28 '15 at 16:34
  • @amoeba Sorry I just spaced when entering the second formula--the corrected version is up there now. Will think more regarding your second point! – Richard Border Apr 28 '15 at 18:12
  • 2
    Well, okay, I should take that back: your original formula is correct in the special case when the prior probability $p=0.5$. You actually had it written all along, but I did not notice; sorry. Also, you are right that for any given $p$ (apart from $p=0$, or your $\text{prior}=1$), FDR will grow with decreasing power reaching $1$ at zero power. So your question makes sense, +1. – amoeba Apr 28 '15 at 19:34
  • A proper FDR correction across studies should be this: if the first author has already published 150 papers, the second author, 70 papers, the third author, 80 papers, and the fourth author, 120 papers, then the significance level should be at most 0.05/(150+70+80+120)=0.05/420=$1.19\cdot10^{-4}$. Editors should reject papers unconditionally if this level of significance is not achieved. Explanation: at the stage of your career when you published 150 papers, you should have enough funding to run studies that have enough power. – StasK May 01 '15 at 21:02
  • There are no problems with underpowered studies other than waste of resources. By test design, you will not have more than probability $\alpha$ to get unjustified significance. But in choosing $\alpha$, you have decided that this is still tolerable. In the converse, if nothing significant shows up, the only true interpretation is "we have learnt from the experiment as much as if we wouldn't have done it: nothing." Which isn't problematic either. – Horst Grünbusch May 08 '15 at 16:29
  • 1
    @Horst, the "problem" with underpowered studies (that OP is describing) is that if all the studies in some field are grossly underpowered, then they will rarely detect a true effect, whereas with probability $\alpha$ report a false discovery, which can lead to most of the reported discoveries being false (i.e. to a very high FDR). This is not a nice situation for a scientific field to be in. – amoeba May 08 '15 at 16:33
  • @amoeba: Right, as the local $\alpha$ is only designed to bound the error rate of a single test, so it's not suitable for a joint conclusion about multiple studies. However, is the FDR in this field a useful tool? Do we conclude: "65% of the studies found a effect, so there is one"? One could do this, i.e. one could consider test results of different studies as Bernoulli r.v. But one looses all the information of the effect estimators. So rather look at the estimators. The researcher needs them anyways and they help you get rid of the "problem" with underpowered studies. – Horst Grünbusch May 08 '15 at 16:58
  • Could someone point me to literature that provides theoretical foundation for the relation between FDR and statistical power? – Aleksandr Blekh May 08 '15 at 18:15

5 Answers5

7

In order to aggregate the results of multiple studies you should rather think of making your results accessible for meta analyses. A meta analysis considers the data of the study, or at least its estimates, models study effects and comes to a systematical conclusion by forming some kind of large virtual study out of many small single studies. The individual $p$-values, ficticious priors and planned power are not important input for meta analyses.

Instead, it is important to have all studies accessible, disregarding power levels or significant results. In fact, the bad habit of publishing only significant and concealing non-significant results leads to publication bias and corrupts the overall record of scientific results.

So the individual researcher should conduct a study in a reproducible way, keep all the records and log all experimental procedures even if such details are not asked by the publishing journals. He should not worry too much about low power. Even a noninformative result (= null hypothesis not rejected) would add more estimators for further studies, as long as one can afford sufficient quality of the data themselves.

If you would try to aggregate findings only by $p$-values and some FDR-considerations, you are picking the wrong way because of course a study with larger sample sizes, smaller variances, better controlled confounders is more reliable than other studies. Yet they all produce $p$-values and the best FDR procedure for the $p$-values can never make up for quality disparities.

  • Horst, you seem to be answering a different question than was asked. – Alexis May 07 '15 at 17:21
  • 2
    Note that the question is about FDR between studies, not within. This involves some kind of bayesian approach in order to have an acceptable overall rate of correct decisions. My answer stresses that an overall judgement is rather done by aggregating study data and estimates, not decisions, so the issue resolves by creating a huge "virtual study", as long as the data (not the decisions) of the single studies are reliable. – Horst Grünbusch May 08 '15 at 01:23
7

If I [individual researcher] have a guess of what the size of the effect I'm studying should be [...], should I adjust my $\alpha$ level until the FDR = .05? Should I publish results at the $\alpha=.05$ level even if my studies are underpowered and leave consideration of the FDR to consumers of the literature?

I would definitely not try to adjust the $\alpha$ level to reach a certain FDR, because it is very difficult: not only do you need to have a good estimate of power, but also a good estimate of the prevalence of nulls in some vaguely defined (!) population of studies that you imagine your own study to be part of. This is hardly possible.

On the other hand, even though I engaged in a long discussion with @DavidColquhoun about some specific claims in his paper, I do on some level agree with his practical recommendations in that $p<0.05$ does not strike me as a particularly strong evidence. Personally, I have learned to consider it as relatively weak, and am not convinced at all by many published results that hinge on a single $p\approx 0.05$. Truly convincing scientific results usually have either a tiny $p$-value $p\ll 0.05$, or are based on several experiments with supporting conclusions (such that a "combined" $p$-value would again be tiny).

So instead of adjusting your $\alpha$ in some specific automatic way, I would rather suggest to remain generally very skeptical about your own findings. Especially more so, if you know that your study is underpowered. Get more data. Think of a supporting analysis. Run another experiment. Etc.

amoeba
  • 104,745
5

This is actually a deep philosophical question. I'm a researcher myself and I've thought a while about this. But before an answer, let's review exactly what the false discovery rate is.

FDR versus P P is simply a measure of the probability of saying that there is a difference, when there is no difference at all and doesn't take the power into account. The FDR, on the other hand, takes the power into account. However, in order to calculate the FDR, we have to make an assumption: what is the probability that we receive a true positive result? That's something that we will never have access to, except under highly contrived circumstances. I actually spoke about this recently during a seminar I gave. You can find the slides here.

Here is a figure from David Colquhoun's paper on the topic:

Calquhoun 2014

The false discover rate is computed by dividing the number of false positives by the sum of the true positives and the false positives (in the example, 495/(80+495) x 100% = 86%!

A little bit more on P

Take a close look at the slides from my lecture. I discussed the fact that P values are drawn from a distribution. Which means that there will always be a chance that you will find a false positive. So statistical significance shouldn't be thought of as absolute truth. I argue that something that is statistically significant should be interpreted as, "Hey, there might be something interesting here, I'm not sure, someone go double check!" Hence, the fundamental notion of reproducibility in research!

So... what do we do? Well, an interesting point about the figure above and my analysis of P and FDRs is that the only way we can ever achieve a clear understanding is through 1) reproducibility and 2) publishing all results. That includes negative results (even though negative results are difficult to interpret). However, the conclusions that we draw from our results must be appropriate. Unfortunately, many readers and researchers do not fully understand the notions of P and FDR. I believe it is the responsibility of the readers to appropriately analyze results... which means that the burden is ultimately on the shoulders of educators. After all, a P value of 0.000000001 is meaningless if the "prevalence" (see figure above) is 0 (in that case, the false discovery rate would be 100%).

As a publishing researcher, just be careful to fully understand your results and make claims only as strong as you are willing. If it turns out that the FDR for your particular study is 86% (like the example above), then you should be very careful about your interpretations. On the other hand, if the FDR is small enough for your comfort.... still be careful about your interpretations.

I hope everything here was clear. It's a very important concept and I'm glad that you brought up the discussion. Let me know if you have any questions/concerns/etc.

  • +1 I like the thoughtful. Two reactions: (1) negative results are just as easy to interpret as positive results; negative result: because there is no effect, or because you are underpowered?; positive result: because there is a relevant effect, or because you are overpowered? Combining tests for difference with tests for equivalence addresses both kinds of issues. (2) Are you sure (statistical) "significance" doesn't mean "given my study design assumptions and a preferred type I error rate, evidence prefers $H_{A}$ to $H_{0}$?" – Alexis May 07 '15 at 17:28
  • 1
    @Alexis There is no such thing as an overpowered study! As long as the effect size is noticed, there can be no harm in being able to define the size of the effect more closely by having a study with a larger sample size. The notion of 'overpowered' seems to me to be tied to the empty notion that one can make useful inferences from looking at a P-value without looking at the observed data. – Michael Lew May 08 '15 at 00:42
  • @MichaelLew I am afraid I respectfully but strongly disagree (assuming I understand you :). Study certainly can be overpowered. See for example, Why does frequentist hypothesis testing become biased towards rejecting the null hypothesis with sufficiently large samples?, and my answer to it. – Alexis May 08 '15 at 01:44
  • @Alexis No, I'm sorry to say that you are mistaken. Only 5% of samples of one million will yield P-values of less than 0.05 if the null is true and the assumptions of the test are valid, just as with smaller samples. The larger sample will generally yield a more accurate estimate of the true parameter value, and will more often a 'significant' result when the null is nearly, but not quite true. Of course when the null is nearly true it is actually false and so it is appropriate that a large enough sample will cast doubt on it. – Michael Lew May 08 '15 at 13:03
  • @Alexis Continuing... You answer to the question that you link is misleading. You say "This occurs because the p-value becomes arbitrarily small as the sample size increases in frequentist tests for difference" but that is only the case when the null is false. There can be no 'bias' in rejecting the null when it is false. – Michael Lew May 08 '15 at 13:05
  • 1
    @MichaelLew: You are right that the issue of overpowering could be (partly) resolved if you always consider the estimated effect size together with the p-value. However, this defeats a bit the purpose of p-values: Mapping the effect estimator to the binary test results "effect present/not present" such that type I error rate is coorect. Also, your judgement what a relevant effect size may be, may change as you see the p-value. So it is in fact best to adress the issue by prespecifying a relevant effect range in advance and subsequently compare it with the study CI, as Alexis suggested. – Horst Grünbusch May 08 '15 at 16:14
  • 1
    @MichaelLew You are assuming that a null hypothesis value can actually exist. That is, that $\theta$ can actually equal exactly zero, as opposed to a range of values very close to zero. Frequentist tests are indeed biased towards treating as significant nearly zero measures, as the OP of the linked question described. The only resolution to that quandary (within the realm of frequentist tests) is to explicitly address effect sizes that are relevantly large. And if I am mistaken, I am in very good company. :D – Alexis May 08 '15 at 16:53
  • Nice discussion, but @Alexis, the reason I said that negative results are difficult to interpret is because you don't necessarily know how or why the results were negative, even in a well controlled study. Were they negative because of your experimental protocol? What part of your protocol exactly? With a positive result (assuming the study as been conducted well) you should see a clear difference between isolated variables. That narrows the scope of interpretation. Does that make sense? – justanotherbrain May 08 '15 at 17:02
  • 1
    To a degree... I was strictly speaking in terms of statistical inference, while you are speaking more about the logic of study design and an ontology of producing scientific knowledge. That said, I feel that positive findings that are not interpreted with as much care wrt protocol, etc. are just as likely to be spurious as negative findings. Not all phenomena of the universe are amenable to study in isolation (e.g. both individual and population health are simultaneously chemical, social, behavioral, etc.), and so ontological uncertainties must accompany studies of such complex systems. – Alexis May 08 '15 at 17:10
  • @HorstGrünbusch I think that you are mistaken about the "purpose of p-values" in that they are a product of significance tests but are unnecessary to hypothesis tests. It is hypothesis tests that provide the binary "effect present/not present" outcome that you mention. See these answers for more information: http://stats.stackexchange.com/questions/16218/what-is-the-difference-between-testing-of-hypothesis-and-test-of-significance/16227#16227 and http://stats.stackexchange.com/questions/46856/interpretation-of-p-value-in-hypothesis-testing/46858#46858 – Michael Lew May 08 '15 at 21:17
  • @MichaelLew: OK, I meant "purpose of tests" (as only they map to the binary space, p-values map to the unit interval). I agree that overpowering has different relevance in the Fisher paradigm. – Horst Grünbusch May 09 '15 at 00:47
  • @MichaelLew: Following your hint on the testing paradigm, doesn't involve this question the hybrid approach? – Horst Grünbusch May 09 '15 at 01:04
  • 2
    @HorstGrünbusch I don't see the original question as being set in a hybrid context as it deals with alpha and beta, not P-values. However, justanotherbrain's answer would certainly need careful re-working to place it solely in either the Neyman & Pearson framework or the significance testing framework. False discovery rates really only belong in the former. – Michael Lew May 09 '15 at 02:34
4

To help understand the relationships, I created this graph of FDR as a function of prior probability for various powers (with alpha=0.05). Note this graph, and the equation of @Buckminster computes the FDR for all results with P less than alpha. The graph would look different if you only considered P values very close to the P value you happened to observe in one study.

1

To suggest publication is a decision. I think that it is worthwhile to study what are benefits and costs associated with this decision.

1) Academic environment universally pushes researchers to publish more, thought various rankings of publications will affect also this record. We can presume that more prestigious journals might have more robust quality checking (I hope so).

2) There might be social costs associated with the too large production of publications. These resources might be better used somewhere else, like in applied research without publications of results. There was recently a publication that many publications are not important as sources since sheer amount of new publications is so large... :)

http://arxiv.org/pdf/1503.01881v1.pdf

For the individual researcher number one forces to publish more and I think there should be institutionalized quality checks which are not dependent on individual peoples to keep quality at accepted level.

In any case your parameter values are not facts, these must be given values by the consideration of various costs and benefits associated with number of results published when results are truly and/or falsely significant.

Analyst
  • 2,655