How can I maximize statistical power for small sample sets with a known, potentially non-normal, distribution?

Question

I want to estimate the mean difference of a variable between populations with some level of confidence and with a minimum number of samples collected. The populations share the same variance and distribution, which may look normal, bimodal, or whatever else. Right now, I use a t-test and collect 40 samples for each population to compensate for the fact that the distribution can be non-normal. I then look at the difference of means formula for t-tests to determine the mean difference's confidence interval. Typically this confidence interval is acceptably small. However, is there a way I can do this using fewer samples? Can I leverage the fact that I know the distribution precisely?

The specific use case is that I'm making a startup-time benchmarking tool that takes two versions of an application, the control version and the experimental version, and benchmarks each by recording their startup time $N$ times, and reports if the experimental version has a significantly slower or faster startup than that of the control. It has to be run quickly, meaning with as few samples as possible, to give users results in an acceptable amount of time. The application's variance and distribution changes very little over any 24-hour span, so every 24 hours I can get a large dataset for the latest version of the application, e.g. with 1,000 startup times, and use that for the next 24 hours for all versions of the application. However, over the past few years, the distribution has slowly gone between normal, bi-modal, and other sorts, so it's not very stable beyond 24 hours.

If there is a distinct possibility that the distribution is not normal, you may wish to consider a non-parametric test like the Wilcoxon sign. — Avraham, Dec 22 '21 at 21:56
Wilcoxon is agnostic to the distribution though, no? Is there some method that takes advantage of the fact that I know the distribution? — meisel, Dec 22 '21 at 21:59
But do you really? You say that the distribution has "has slowly gone between normal, bi-modal, and other sorts". The true distribution has to be rather complex for 24-hour samples from it to morph as you describe, unless I misunderstood? — Avraham, Dec 22 '21 at 22:00
It can be complex, but it changes over the course of months, not hours — meisel, Dec 22 '21 at 22:02
Sorry, your last sentence implied that 24 hours was roughly the time it took for changes to noticeably manifest. That being said, unless you have some kind of regime change going on, the "true" distribution remains complex enough that sufficient samples cannot pin-down a distributional form, so the loss of power by using a non-parametric test may still be justified given the "local" view of normal may be a mirage. Just a thought. — Avraham, Dec 22 '21 at 22:08
What is the relationship between the tests with "40 samples" and the "1,000 startup times" you have each day? — whuber, Dec 22 '21 at 22:20
The 40 samples would be that we measure the startup time 40 times for that version of the application, and can assume that its distribution is close enough to the distribution we see when we measure it 1,000 times instead of 40. — meisel, Dec 22 '21 at 22:32
You have completely lost me: why not base your decisions on all 1,000 measurements, then? Your description is a little to vague to permit a clear conceptualization of what you are doing, what you are trying to do, and what your constraints are. — whuber, Dec 22 '21 at 23:02
Each new version of the application can have a faster or slower startup, but will still have a similar distribution/variance. So, it still needs to be sampled to see if its startup is different. — meisel, Dec 22 '21 at 23:39
You say that you want to exploit the fact that you know the distribution (legitimate and a good idea), but you also say that the distribution keeps changing, making me think that you have no way to know what kind of distribution you have. Please clarify, because, if you know that you have a particular distribution family, you can use inference methods specific for that family. — Dave, Dec 23 '21 at 14:32

Geoffrey Johnson · Answer 1 · 2021-12-23T14:19:47.133

Regardless of the distribution for the data generative process the difference in sample means is consistent and unbiased for the difference in population means. Additionally, the two-sample Wald or t-test works for non-normal data so long as the sample means are approximately normally distributed. Here is a related thread.

One way to investigate whether "approximately normal" holds in your particular setting would be to take bootstrap samples of size n (sample with replacement from your data set as if the data set is the population of interest) many times, each time calculating the difference in means. If the resulting histogram of difference in sample means appears approximately normal then you can feel comfortable with the two-sample Wald or t-test.

If the histogram suggests the difference in sample means is not approximately normal you can use a link function (transformation) of the individual means or of the difference in means to improve the normal approximation. One example is a log link. This is not a transformation of the subject-level observations, it is a transformation of the sample means or the difference in means. Using an appropriate non-identity link function would allow you to utilize a smaller sample size than you would otherwise when using an identity link. Here is a thread discussing link functions when constructing a CI for a Bernoulli proportion.

The first two paragraphs are great. The third, which suggests using nonlinear transformations, doesn't answer the question, because it isn't comparing arithmetic means any longer. The answer you link to doesn't transform the data, either: it's merely a mathematical device to construct a confidence interval and so doesn't look directly relevant here. — whuber, Dec 22 '21 at 23:01
Hi whuber, at no point am I suggesting nonlinear transformations of the subject-level observations. I am only recommending a mathematical device to construct a confidence interval to improve the normal approximation behind the Wald test. This device is useful when the sample size is small and the identity link does not produce an approximately normal sampling distribution. This is directly relevant to the OP. — Geoffrey Johnson, Dec 23 '21 at 14:17

How can I maximize statistical power for small sample sets with a known, potentially non-normal, distribution?

1 Answers1