I have a dataset of 2 conditions. Each condition has 15 measurements. I tested them using paired t-test to find if the difference is statistically significant. Should I use a p-value correction method for such a single test with a small sample size? If I should, which methods are suitable for this case (except Bonferroni)?
-
3Bonferroni "correction" with m=1 returns the same p-value that you input. With just one hypothesis test, you don't have multiple comparisons to correct for, so it doesn't do anything at all. The formula still "works" for m=1, so you can apply Bonferroni correction, but it's pointless. – Nuclear Hoagie Apr 27 '23 at 19:37
3 Answers
No, for a single test you have a single p-value, thus no correction method is needed. Indeed, multiple comparisons problem arises when you perform many statistical tests or when you build many confidence intervals on the same data. Also, the small sample size issue is irrelevant to the multiplicity issue.
If you are worried about the validity of the p-value in light of the small sample size, then you may try a test via Bootstrap or permutation.
- 11,726
-
4
-
1@dipetkov Why wouldn't it be? A small sample is an issue irrespective of the technique. https://stats.stackexchange.com/questions/112147 has a good discussion on this (+1 to the answer). – usεr11852 Apr 27 '23 at 17:12
-
2@usεr11852 Do you mean to state that all techniques are equally robust to small sample size? Personally I don't think this is great advice. – dipetkov Apr 27 '23 at 17:25
-
1@dipetkov reading Chapter 2 of Davison and Hinkley's book on the Bootstrap couldn't find any contraindications. Indeed, scholars such as https://www3.stat.sinica.edu.tw/statistica/j33n2/j33n222/j33n222.html, use the bootstrap to improve inference in problems with low n/p. – utobi Apr 27 '23 at 19:44
-
1@dipetkov: Start with $n=2$. :) (Of course, bootstrap is not a panacea but on the other hand, it is a reasonable suggestion in the context of this answer and an unreasonable point to critique it without context. The smaller the sample size, the smaller our power as a general principle, bootstrap or our favourite technique doesn't change that.) – usεr11852 Apr 27 '23 at 20:36
-
1@usεr11852 One thing I've learned about bootstrapping from reading about bootstrapping is that testing hypotheses is not exactly where it shines. Why mention all kinds of statistical procedures in an answer and not attempt to suggest to the OP what might work best for their situation? – dipetkov Apr 27 '23 at 22:14
-
@usεr11852 The paper you link to, isn't it about parametric bootstrapping? I wonder if that has something to do with small sample sizes? – dipetkov Apr 27 '23 at 22:16
-
- I didn't link any paper. 2. You haven't made any suggestions. (which is actually the main reason I commented in the first place.)
– usεr11852 Apr 27 '23 at 22:24 -
1@usεr11852 My suggestion, which should be quite obvious by now, is that bootstrapping a p-value when there not one but two small samples is bad advice. (Though I acknowledge you disagree.) I suggest that a permutation test is a better choice, if the OP wants to consider a non-parametric approach. – dipetkov Apr 27 '23 at 22:33
-
Thank you for being constructive. (+1) 1. I also think that permutation tests are preferable to bootstrapping when it comes to testing. (That doesn't mean though that bootstrapping is an unreasonable suggestion for this ask.) 2. If you had mentioned permutation tests in your initial comment we wouldn't need this discussion. – usεr11852 Apr 27 '23 at 22:54
-
@dipetkov More generally: I personally believe, that if one has a critique, they should give an alternative too. I have come across quite a few occasioons where people will undermine a half-reasonable suggestion without an alternative suggestion as to simply "show off" while being lazy to put their money where their mouth is. Academic debates are often rife with that. – usεr11852 Apr 27 '23 at 22:59
-
1@usεr11852 I have no idea why you felt compelled to jump in. I thought utobi's answer was appropriate, except for the bootstrap suggestion. So I asked about it. I didn't do it in a disrespectful manner. That's probably a good enough reason to not be having this discussion. – dipetkov Apr 27 '23 at 23:15
-
1I thought your comment was appropriate but asked you to elaborate on your comment cause I couldn't understand where you were coming from as it gave no alternative suggestions. Small sample sizes are primarily affected by power, not significance issues in my view so I saw bootstrapping as a reasonable suggestion in the answer. – usεr11852 Apr 27 '23 at 23:27
-
1@dipetkov How are permutations gonna work with a paired t-test which is effectively a one sample t-test. What are you gonna permute in this single sample? I guess that resampling by means of bootstrapping is the only resampling that can be done here. – Sextus Empiricus Apr 28 '23 at 06:29
-
@SextusEmpiricus I find this is a somewhat bizarre hill to die on (ie, write comments that are not as well thought out as they might have been). For the OP here is how to do a permutation test on paired data: Randomisation/permutation test for paired vectors in R. – dipetkov Apr 28 '23 at 09:54
-
1@dipetkov Are you sure that permutations are a good idea with paired data? Personally I don't think this is great advice and even worse than bootstrapping (which is more or less fine for n=15). – Sextus Empiricus Apr 28 '23 at 10:30
-
@SextusEmpiricus This entire thread of comments started with me wondering whether utobi is giving out good advice... So I won't disagree. (Also I'm worried about the pile on continuing.) I find Dave wrote a better answer so that's the answer I upvoted. – dipetkov Apr 28 '23 at 10:33
-
@usεr11852 While this issue is not black-and-white, I have been going over some of my reading: Tim Hesterberg. What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum. https://doi.org/10.48550/arXiv.1411.5279. "Bootstrap hypothesis testing is relative undeveloped, and is generally not as accurate as permutation testing. For example, we noted earlier that it is better to do a permutation test to compare two samples, than to pool the two samples and draw bootstrap samples." [NB: the statement is not about paired data.] – dipetkov Apr 28 '23 at 10:57
-
1Also related: https://stats.stackexchange.com/questions/482654/is-bootstrap-problematic-in-small-samples – Richard Hardy Apr 28 '23 at 14:15
With a small sample size, there are legitimate concerns.
What kind of power do you have to reject a false null hypothesis?
If your data lack normality, do you have enough data for the t-test to be robust to the deviation to the assumed normality?
The latter feeds into the second, as deviations from normality tend to affect t-test power, rather than t-test size. That is, such deviations from the assumed normality make it more difficult to reject false null hypotheses, rather than making it easier to reject true null hypotheses.
However, having a small sample size does not make it so your test statistic lacks the claimed distribution. The small sample size is accounted for by using a low number of degrees of freedom. What could be concerning is that, when the sample size is low, the true distribution of the t-stat might differ to a meaningful extent from the claimed distribution, meaning that your p-values are not really telling you what they are supposed to tell you, since they are calculated from incorrect distributions.
Since you only have one test, no p-value adjustment is needed to account for multiple tests.
- 62,186
(This answer ignores the issue with low sample size.)
I'd like to add a bit of nuance to the answers here, as it's tempting to read them and come away with this rule:
If a single p-value is observed, then correction is unnecessary; the type I error of the testing procedure is not inflated.
Type I error is inflated if any part of the testing procedure, i.e., any part of the pipeline from dataset -> p-value, depends on the data. Read The garden of forking paths1 paper for more info.
For example, if—
- (in another universe) the observations turn out to be so skewed that after plotting them you opt to test for the equality of medians rather than means, and
- you do not want to inflate the type I error
—then a p-value adjustment needs to be made. For example, if you were to use a Bonferroni correction, then the number of null hypotheses here is 2: one for the mean test, and one for the median test. Note that you did not have to observe 2 p-values; you just had to use a testing procedure which allowed for 2 different hypotheses (whether you knew it or not!).
Adjustment is unnecessary if universes like (1) don't exist, or you don't care about inflating type I error. To destroy universes like (1), the entire testing procedure needs to be pre-specified, e.g., force yourself to test for the equality of means regardless of the observed distribution (such a decision should usually be based on the science—not the data—anyway). In practical terms, this means implementing the entire data analysis code without looking at any data from your experiment.
References
- Gelman, Andrew, and Eric Loken. "The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time." Department of Statistics, Columbia University 348 (2013): 1-17.
- 1,420