6

I apologize if this question has been done to death, but as a non-statistician I really don't know what is the bottom line takeaway. I am looking at a sample of 30,000 individuals who were the subject of an economic intervention. This intervention produces a mean increase in the annual income of the subjects of $2000. However the result is not statistically significant at usual levels. From a statistical viewpoint, has the intervention failed, and if not what further can be done?

mpiktas
  • 35,099
  • 8
    How did you test the increase for significance? Was a control group used? – Nick Stauner Jun 01 '14 at 19:48
  • The love the strict adherence to the ideals of the scientific method implicit in the question. Lies, damn lies and statistics right? :D – Nathan Jun 02 '14 at 16:02
  • 2
    @NathanCooper: More like strict adherence to the conventions of statistically naive scientific practice. If statistics is to blame, its fault is in providing simple methods with too much appeal in even inappropriate circumstances, or failing to emphasize their limitations sufficiently. (The probability that statistics is to blame is pretty low IMO.) – Nick Stauner Jun 02 '14 at 16:47

5 Answers5

12

Statistical insignificance does not mean that the effect being tested for does not exist, but rather, that the data that was observed does not furnish strong evidence for the existence of that effect.

For example, if you have an unloaded six-sided die, but the numbers on its faces are {1,2,3,4,5,5} instead of {1,2,3,4,5,6}, and you roll it only 3 times, it may not be evident through such a small sample size that the die would give you more fives than ones. That doesn't mean the die isn't different than a normal die (after all, we have the benefit of inspecting it and we can clearly see it is different)--it may simply be that we need to collect more data about the die's observed behavior in order to make a statistically significant inference about the intrinsic properties of the die.

Analogously, it may be that even a sample size of 30000 may not be sufficient to detect a difference in the behavior of your population under two treatments, because your statistical test has low power. Or, maybe the truth is that the mean increase you're observing is actually due to random chance and no effect truly exists. Since you have not specified your tolerance for Type I error, I can't really speak to that.

The takeaway here is that failure to detect significance doesn't mean no effect exists--it simply means that, by random chance or by lack of power, the data furnishes insufficient evidence to claim that the hypothesized effect exists with a high degree of confidence.

heropup
  • 5,406
9

Well, this is certainly not good news. Sorry.

Your results don not provide any evidence for the existence of an effect. The effect, of course, might still exist: it could be smaller or more variable than you expected, or your experiment was somehow flawed and failed to detect it.

So, what can you do now?

0) Check your data. Make sure nothing silly has happened. Missing values sometimes get coded as 0s/-1s/99s, and these numbers obviously shouldn't be entered into your analysis as actual values. Similarly, if you're randomizing people to treatments/controls, make sure these groups are actually similar. People get bitten by these sorts of bugs all the time.

1) Perform a power analysis. Ideally, you would have performed one before beginning the project, but doing one now can still help you determine whether your experiment, as performed, would have a reasonable chance of detecting your expected effect. If not, (perhaps your drop-out/noncompliance rate was very high), you might want to perform a larger experiment.

You should not add subjects, run the analysis, and repeat until your result becomes significant, but there are lots of strategies for mitigating the problems associated with taking multiple "looks" at your data.

2) Look at sub-groups and covariates. Perhaps your proposed intervention works best in a specific geographical region, or for younger families, or whatever. In general, it would best to specify all of these comparisons ahead of time, since exploiting "experimenter degrees of freedom" can dramatically increase the false positive rate.

That said, there's nothing wrong with looking per se. You just need to be upfront about the fact that these are post-hoc/exploratory analyses, and provide weaker evidence than an explicitly confirmatory study. Obviously, it helps a lot if you can identify plausible reasons for why the subgroups differ. If you find a hugely significant effect in the North, but nothing in the drought-stricken, war-ravaged South, then you are in pretty good shape. On the other hand, I'd be a lot more skeptical about a claim that it works on subgroups of people born during full moons but only at high tide :-)

If you do find something, you may be tempted to publish right away. Many people do, but your argument would be much stronger if you could confirm it in a second sample. As a compromise, consider holding out some of your data as a validation set; use some of the data to look for covariates and the validation set to confirm your final model.

3) Could a null result be informative? If previous work has found similar effects, it may be useful to see if you identify factors that explain why they weren't repeated in your population. Publishing null results/failures-to-replicate is often tricky because one needs to convince reviewers that your experiment is sufficiently well-designed and well-powered to detect the sought-after effect. With $n=30,000$ however, you are probably in pretty good shape on that front.

Good luck!

Matt Krause
  • 21,095
  • 1
  • I feel someone could grab the wrong end of the stick here. If you go on a fishing expedition you lose pretty much all the evidential power. It is possible to check for subgroups in the upfront design, but there are usually enough sensibly distinct groups to provide bogus type I's in this kind of exploratory analysis.
  • – Nathan Jun 02 '14 at 16:00
  • 1
    @NathanCooper, you're absolutely correct. The paper I linked to describes how bad fishing expeditions can get (and it's pretty grim).

    However, having spent a lot of time and money gathering data, it would be nice to get something out of it, even if that something is tentative hypothesis that needs to be rigorously confirmed. If including very plausible covariates produces a big effect size, then that can be relatively convincing (but yes, still needs to be explicitly confirmed). If one needs a crazy-quilt of inclusions and exclusions to push something just into significance, then...no.

    – Matt Krause Jun 02 '14 at 16:36
  • Re #1: you seem to be recommending post-hoc power analysis. I don't think that's a useful thing to do. I like this paper by @rvl on the issue: http://www.stat.uiowa.edu/files/stat/techrep/tr378.pdf – Jake Westfall Dec 08 '15 at 18:24
  • I was aiming for something slightly different. One problem with post-hoc power analysis is that they use the observed effect size (which we already know is small, or we wouldn't be here). However, it's not totally nuts to plug your expected effect size into a power analysis and see if your experiment--as run--could have detected it. Perhaps you ended up with considerably less data than in your initial plan (subjects often drop out of experiments, the subject pool might be different than you expected, etc). I'd agree that running a power analysis beforehand would best though. – Matt Krause Dec 08 '15 at 20:06