Suppose there is a data set $D$. There are 20 groups of researchers independently. Each group will publish a few results on this data set $D$. Suppose each of them find about 3 tests significant($5\%$ significance level) and concluded significance of their results. Assume tests are non-overlap and there is possible causal relationships between results. Assume each group has similar research quality.(This assumption surely breaks down in reality.)
If we aggregate the result, there should be 60 hypothesis testing done whereas individually there are only 3 done. On $5\%$ significance level, there would be 3 expected type I error after aggregation. The type II error would be inflated as well. Individuals may see a few perspectives but aggregation has a lot.
How much trust should one place upon multiple groups studying the same data set here? Of course, we can adjust FDR or adjust p-values correspondingly.
Should one keep first 6 published papers and toss the other 14 or should one randomly keep 6 papers and toss out the other 14?