5

If I have two groups, one with a sample size of, say, 700,000 observations and another with 10,000 observations and I want to test the difference between the means of the two groups, what would be the best way to go about it?

  1. Using Welch's t-test because it is not affected by unequal variances (which usually show up because of the difference in sample sizes).
  2. Taking a random sample from the '700,000' group? (a random sample of 10k observations). I took 1000 samples of 10k from the bigger group and the p-value was always <0.05. But another interesting thing I read somewhere that p-values are always low if the data sample size is really big.
  3. Any better way of doing it?

Also, will the Welch's t-test results be untrustworthy because of the underlying skewed distributions?

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • 1
    Not true that the P-value is always low for large sample sizes. If there really is a difference, a large sample will increase the probability of detecting that. But if there is no difference, a large sample won't 'invent' one for you. Example in R: set.seed(12); x = rnorm(1000,100,10); y = rnorm(1000,100,12) t.test(x,y)$p.val returns P-value 0.1664101. – BruceET Aug 08 '20 at 00:18
  • 1
  • Why do you say that the unequal variance shows up due to unequal sample sizes? 2) Do you mean Welch’s t-test as opposed to the equal- variance t-test?
  • – Dave Aug 08 '20 at 01:28
  • @BruceET Thanks for the example, I shouldn't have stated it like a fact. What I meant by that was that because of the larger sample size the test would be sensitive to the smallest of difference. Maybe using something like Cohen's d would help look at the extent of the effect? – Vardayini Aug 09 '20 at 08:25
  • @Dave 1) I'm a newbie, so I read on a lot of answers that the assumption of equal sample sizes is there to say that approx. the variances are equal in the two groups. 2) Yes, I mean Welch's t-test. my bad, I'll update the question. – Vardayini Aug 09 '20 at 08:28
  • Your last paragraph seems to be the only place you refer to "underlying skewed distributions". With pronounced skewness, even whether t tests of any flavour are a good idea could be at issue. Other way round, if skewness here is a slip for unequal spread, please fix the question. – Nick Cox Aug 09 '20 at 10:17