How should I handle many-vs.-few balanced single-factor parametric statistical tests?

Question

I'm performing a number of single-factor parametric statistical tests (in my particular case, random-effects meta-analytic linear models, but I guess the problem would be just as relevant with for example one-way ANOVAs) with different amount of levels.

For some of these tests, I have highly unbalanced designs, where one of the levels only have a couple of data points (or maybe just one) while the others have plenty (as a concrete example, I might have 1 data point for one level and 80 and 100 for the other two).

Now, how should I handle cases like these? If all of them just would have come out non-significant, I wouldn't really have to worry about them that much, but some of them come out significant. I have zero faith in these results (that is, that the results could be generalizable), but I don't know how I can dismiss them. Just setting an arbitrary limit on how many data points that have to be present for a certain level for me to include it in the analysis feels, well, arbitrary, but I'm a bit confused on how to think here.

How can this problem be solved?

Why not simply use a multilevel model/mixed model? As I understand it, these are more robust to lack of design balance than ANOVA-type models, but you can answer the same kinds of questions. — Alexis, Jun 08 '17 at 20:27

score 4 · Answer 1 · answered Jun 08 '17 at 23:22

Below is some example of data to aid the conversation rather than speaking about 'random-effects meta-analytic linear models' in which case I do not really know what you want to do. I hope the example may help you and maybe you can add some more specific example of your own to place your problem in a better perspective.

You have two aspects to deal with.

can we make predictions based on few data points?
hypothesis test accept/reject

The example below illustrates the first aspect.

FEW DATA POINTS

By, using the assumption that the variance remains equal you can make estimates of confidence intervals even if you have only a single measurement for some level. This is because you can use the information of the many measurements at the other levels to make an estimate of the variance of the measurements. Say, the variance is estimated to be about x and you find a point at a distance 2 times x from some boundary, then you can consider the point as significant.

Take note of the the two very close measurements at 3 on the x axis in the image below. Some people would create an estimate of the standard error of the means, separately for each of the levels, and then decide that the points that happen to be very close, relate to an accurate estimate of the mean for this level with a low standard error. (this could happen by chance, and this example which is randomly generated data, demonstrates the difficulty in such reasoning)

HYPOTHESIS REJECTION/ACCEPTION

On this aspect you can go into many directions with technical details. And there is no clearcut answer.

I do not entirely get your comment about

"If all of them just would have come out non-significant, I wouldn't really have to worry about them that much, but some of them come out significant."

1) What does it mean 'to come out non-significant'? Your zero faith means that you suspect false positives?

One think that you have to look into is how to correct for, or deal with, multiple comparisons, and you can use many options which depends a bit on your situation and preference.

2) I also wonder what you are doing with your results. What are the practical consequences of significant results?

Instead of worrying about false positives, I would worry about false negatives. If you only have a small amount of measurement points then the error in your estimate, or the spread of the posteriori probability distribution (whatever flavour of statistics interpretation you like), will be large. This means that you would easily accept the null-hypothesis, although your test is too weak in order to detect the effects.

You should also consider whether some relevant limit for the effect size isn't inside the boundaries of the range of the estimated effect-size or in the range of the posteriori probability distribution. You might say that the outcome is non-significant if the null-hypothesis is inside the 95% confidence interval... but what if some relevant effect size is also within the confidence interval (because that interval is so big).

This, and variations of the theme, you can use to determine how many measurements are needed or how to interpret results that are neither significant in the sense of tests for the null-hypothesis, nor strong enough to detect the lowest boundary for a relevant effect size.

x <- c(rep(1,20),rep(2,3),rep(3,2),rep(5,11),rep(6,1),rep(7,4))
y <- 4 * x + qnorm(runif(length(x),0,1),0,2)
data <- list (x=x,y=y)

m<-lm(y~factor(x),data)
var<-predict(m,interval="confidence")

plot(x,predict(m),ylim=range(var),pch=21,col=1,bg=1)
arrows(x, var[,2], x, var[,3], length=0.05, angle=90, code=3)
points(x,y,ylim=range(var),pch=1,cex=0.8)

If you want to get rid of your troubles with doubt and zero faith... Bayesian statistics, go Bayesian. — Sextus Empiricus, Jun 08 '17 at 23:30

score 1 · Answer 2 · answered Jun 08 '17 at 20:18

A key question is why you lack confidence in those results.

Is it because the algorithm you use to fit your model (or estimate confidence intervals) is expected to perform poorly at extremely low sample sizes? If so, I would establish an a priori threshold for inclusion (since you've already seen your results, I would ask a trusted colleague to come up with that threshold a priori).

Is it because you are worried that the population from which your data are drawn might violate model assumptions, and when sample sizes are so low you cannot confirm that the assumptions are approximately satisfied? If so, I would establish an a priori quantitative threshold for what makes you willing to assume that assumptions are satisfied. This could be based, for example, on a power analysis for a test for non-normality or non-homogeneity of variance or whatever the relevant assumptions are.

score 0 · Answer 3 · answered Jun 07 '17 at 21:21

Here is how I approached this very same problem in a meta analysis that I conducted. I had a set of multiple levels where in one level there was 30 effect sizes and another there was 3. Now, depending on the sample size associated with those effect sizes, this could be a non issue (i.e. 30 effect sizes with sample sizes ranging from 50-100 vs 3 effect sizes with sample sizes of 500-1000). In our meta-analysis, this was not the case. Not only were effect sizes unequal but the sample sizes nested under those studies were also unequal.

What we did was redefine our levels in order to balance our groups.

How should I handle many-vs.-few balanced single-factor parametric statistical tests?

3 Answers3