Sample size calculation with no reference

Question

I’m conducting a RCT to test if a new type of device (“test device”) is quicker than the device used now (“control device”). The problem is though, that we don’t now what the mean time for the control device is, and we have no way of testing it (no ressources for a pilot study before the RCT either). We know, however, that the new test device will be quicker than the control device, since the control device operates with 5 steps, while the test device only uses the first 4 steps. We have a hypothesis, that the test device will have a mean time that is at least 20% quicker than the control device. However, I have no idea how to calculate a sample size for this study, since we don’t know the control mean time?

The RCT will randomly allocate to test / control group 1:1. Other factors will also play a role; human errors, which might “blur” the mean times, which is why we need to run the study over a longer time period, to sort of “eliminate” this blur. That is, no matter the number of participants, we would likely have to run the study for at least 6 months, to account for these daily/weekly/monthly "human" changes.

Do you have any ideas how to calculate this sample size?

I assume, the hypothesis should be tested using a two-sample t-test (TOST) - superiority trial, with the 20% as margin.

EdM · Answer 1 · 2023-06-05T10:18:54.677

You need not only the expected difference in means but also the associated variances among measurements to estimate sample sizes. Say that you have transformed the time axis so that the average time per "control device" test is 1 unit. Then you want to document that the average time per "test device" test is no greater than 0.8 units on that same scale.

As this isn't a paired design, the sample sizes needed will depend on the variability of test times within each device type. If the variances of test times are similar between the two devices, what matters is the ratio of that difference of 0.2 transformed-time units to the standard deviation $\sigma$ of test times (in those time units) within a type of device. That ratio is the hypothesized "effect size" that you want to be able to detect, and is often represented as d.

That variability thus might be determined by the characteristics of whatever it is you're testing even more than the variability introduced by the test systems themselves. That's the variability due to the "other factors" that "blur" the results. You presumably have some information on that variability of test times among the test objects, based on your understanding of the subject matter.

This PubMed page has a link to a free copy of "Practical guide to sample size calculations: superiority trials" by L. Flight and S. A. Julious, Pharmaceutical Statistics, 15 (1). pp. 75-79.* You choose the risk trade-offs you are willing to take, in terms of the balance between false-positive (Type I) and false-negative (Type II) errors from the study. That combination of variance, effect size, and risk trade-offs determines the sample size.

If the underlying transformed-time values are normally distributed with the same variance $\sigma^2$, Equation 3 of that paper provides a simple value for the size of each group if there is 1:1 allocation, based on Type I error of 0.05 and Type II error of 0.1 (i.e., 90% power). Multiplied by 2 to get the total sample size, you find:

$$ N = \frac{42 \sigma^2}{d^2}, $$

where d and $\sigma^2$ are as above. The paper provides much detail for other scenarios, but that is probably a good guide to initial design. For your desired difference of 0.2 units in transformed-time units, that's equivalent to:

$$N=210 \sigma^4, $$

with $\sigma$ in transformed-time units that bring the mean "control device" test time to a value of 1.

The problem, as pointed out by Michael Chernick, is:

In practice you never know the true variance. What you do is guess at it. This can be done by looking at how the answer changes as the population variance would vary over a range of plausible values. You might find the plausible values through literature on similar studies or through your own pilot study or you can do a two stage adaptive design where the initial stage is used primarily to determine if you need additional data and if so how much.

If you have the resources to do a pilot study just on the "test device" or already have enough data on it to estimate the variance of measurements made by it you could use that as the basis for that guess. For a start you could assume the same variance for the "control device". Perhaps there is information in the literature about the variance with the established "control device."

As Chernick said, this type of problem is sometimes handled by an adaptive design. This answer provides links to related resources. In particular, if you use the first stage of the adaptive design only to estimate variances better, without evaluating treatment differences, there is little risk of inflating the false-positive rate by adjusting the total sample size for the second stage.

Finally, consider whether you really want to do a superiority test. If your "test device" is simpler or cheaper than the "control device," a non-inferiority test could support its preferred use with a much smaller sample size. This web page illustrates how equivalence, superiority, and non-inferiority tests require different sample sizes.

*The page also contains links to similar articles on equivalence and non-inferiority tests.

Which exact formula/calculation are you referring to? I think it it quite a jungle with all the different formulas, so I would appreciate hearing which you would go with.
However, have no idea what the variance would be - to give some estimates of this would be completely guessing. Therefore, I'm worrying that a sample size calculation would be based on so much guessing, that it might be just as plausible to say, like "we run this study for X months".

Good point - but we KNOW that the test will be quicker (simple logic, since it do less steps) - question is only how much quicker. — user12541161, Jun 05 '23 at 08:54
@user12541161 it's the variability of test times among the cases you allocate to the two tests that matters here, including the "other factors" that "blur" the results. I suspect that you have some basis for estimating that variability if you've done some testing already. With an adaptive design you start with such an informed guess and re-evaluate the variance at some point during data collection, to alter the sample size as needed. Think of it as running a pilot phase at the beginning of the study, in a way that you can use the data from the pilot phase for the overall calculation. — EdM, Jun 05 '23 at 10:25
@user12541161 a couple of other thoughts. First, if the humans involved aren't blinded to the type of test then you might get into trouble. Second, this all assumes that the two types of tests will return the same results on any test object, with only the time to get the result different. As an extreme example, I could have an infinitely fast test device if I don't care about the quality of test results at all--just don't do a test! I fear that you might be in an intermediate state, where the "test device" is faster but maybe not of as high quality as the "control device." — EdM, Jun 05 '23 at 10:31
Thanks, and very good point about the "pilot phase" in the beginning of the study!
Regarding your second comment: you are completely right. We don't know if the test device is as good as the control device. And this is of course also a part of the study. However, our main objective is the time difference, since this is the biggest concern at the moment. — user12541161, Jun 05 '23 at 10:32

Sample size calculation with no reference

1 Answers1