I have two independent population data with each having 500 K rows. The goal is to perform a hypothesis test and test the claim that the new machine takes less time than the old machine. This 500 K rows is only a subset of a data that is more than millions. What should be the ideal sample size for performing this test. Initially, I used power analysis with the desired power level, effect size and significance level. But, I was told that this power analysis for sample size is required only during the data collection stage. And as I have large enough data, I can use basically all the data I have in the test.
- Is this true? I am being challenged by this in my team. How to determine the ideal sample size to be taken for performing the test. (Mann whitney U)
- will a sample size that is more than the required produce incorrect results?
However, the idea of measuring the difference is really interesting. Since I’m new to this, I performed the test with limited knowledge I could get within a couple weeks. Could you provide some references, or links where I can find more info about this? Please give more feedback or advice that I’m missing also, it’ll be really helpful.
– AKK Nov 07 '23 at 10:57t.test()even reports a confidence interval for the difference between the two means. The only tricky point to take care of is whether you feed both machines identical input (then you havepaired=TRUEdata), or whether you feed them independently input at random (then you havepaired=FALSEdata). – cdalitz Nov 07 '23 at 11:58