How to find the correct sample size for a non parametric hypotheis test

Question

I have two independent population data with each having 500 K rows. The goal is to perform a hypothesis test and test the claim that the new machine takes less time than the old machine. This 500 K rows is only a subset of a data that is more than millions. What should be the ideal sample size for performing this test. Initially, I used power analysis with the desired power level, effect size and significance level. But, I was told that this power analysis for sample size is required only during the data collection stage. And as I have large enough data, I can use basically all the data I have in the test.

Is this true? I am being challenged by this in my team. How to determine the ideal sample size to be taken for performing the test. (Mann whitney U)
will a sample size that is more than the required produce incorrect results?

Are you sure that you only want to know whether the new machine takes less time? I would assume that the actual time difference is of much more interest. I would thus recommend not to do a hypothesis test, but to determine a confidence interval for the time difference. This interval will be more accurate with more data, so more data should always be better. — cdalitz, Nov 07 '23 at 08:20
@cdalitz and @ J-J-J. To be honest, I’m very new to data science and the situation is machine B is already implemented. So the goal is now to prove that machine B is better.
However, the idea of measuring the difference is really interesting. Since I’m new to this, I performed the test with limited knowledge I could get within a couple weeks. Could you provide some references, or links where I can find more info about this? Please give more feedback or advice that I’m missing also, it’ll be really helpful. — AKK, Nov 07 '23 at 10:57
An internet search for "confidence interval difference between means" should yield detailed information. Actually, the R function t.test() even reports a confidence interval for the difference between the two means. The only tricky point to take care of is whether you feed both machines identical input (then you have paired=TRUE data), or whether you feed them independently input at random (then you have paired=FALSE data). — cdalitz, Nov 07 '23 at 11:58
@cdalitz Thanks, found it. But, my data is very (right) skewed actually. And it's not very clear if the distribution is log normal or gamma (after removing outliers based on IQR). With outliers I'm just seeing one bar on the far left and nothing at all after that. Hence, I decided to go with non-parametric and just prove the hypothesis. I'm not sure how to proceed with this, what would you recommend or what will be the right way of approaching this data? — AKK, Nov 07 '23 at 12:28
With $>10^5$ data points, it is unlikely that you must worry about non-normality for the mean, because it is just a summary statistic based on the sum of all values for which the central limit theorem applies. For such a large sample size, it is unnecessary to resort to non-parametric intervals (e.g. bootstrap). — cdalitz, Nov 07 '23 at 12:57
Nothing really to add to what @cdalitz said. Note that your original question is in fact a quite common one on this website. If you want some additional reading about hypothesis testing with very large sample sizes, the following discussions might be of interest to you: https://stats.stackexchange.com/questions/67320/, https://stats.stackexchange.com/questions/566065, https://stats.stackexchange.com/questions/2516/, and https://stats.stackexchange.com/questions/108911/. You'll notice the emphasis that answers in these discussions tend to give to effect sizes and confidence intervals. — J-J-J, Nov 07 '23 at 14:32

score 2 · Answer 1 · answered Nov 07 '23 at 08:29

You should certainly use all the data you have if it doesn't cost you anything.

The results wouldn't be "incorrect". With a larger sample size, the test will have more power to detect differences of time between the two machines, not less.

However, in this case, the test may be of no use to you, if you're interested in differences above a level that requires a smaller sample size to be detected. In other words, you may end up with small p-values even when the difference between the two machines is of no practical interest to you.

So using confidence intervals instead of hypothesis testing/p-values may be a better option in your case.

How to find the correct sample size for a non parametric hypotheis test

1 Answers1