The difference in means t-test with unequal sample sizes

Question

Someone I know are using linear regression to estimate differences-in-mean for parental stress, given a binary explanatory variable $X_i=[0,1]$. They have found a small, but significant effect.

They have a sample of 3000 parents, but the sample is very uneven, with 2700 parents with $X_i=0$ and 300 with $X_i=1$. So it's a 90/10 distribution. They are concerned that this could be a problem, and are considering randomly drawing 300 out of the 2700. I have argued against this, as a sample mean of 2700 parents will be closer to the population mean than with a sample of just 300 and the standard deviation will be smaller for a larger sample.

However, I have now read a very interesting old thread discussing this topic here on StackExchange, which raises the point that the power of a t-test is stronger for even samples, i.e., the risk of committing a type 2 error is smaller with 50/50 samples:

How should one interpret the comparison of means from different sample sizes?

The approved answer to the thread above shows how the power of a t-test can be stronger with a 50/50 sample than 75/25 and 90/10. The three samples have $N=100$. Increasing the power of the t-test is, as far as I can tell, the only point of insisiting on having equal samples. Now, I want to revisit this topic to ask whether or not this result is relevant for larger samples, such as $N=3000$.

The following R-code is lifted from the thread in the link above, with some alterations to include larger samples. The original post is here.

set.seed(9)                           # To get the some of the same numbers 
                                      # as in the previous thread
power1090 = vector(length=10000)      # Storing the p-values from each
power5050 = vector(length=10000)      # simulated test to keep track of how many
power100900 = vector(length=10000)    # are 'significant'

power3002700 = vector(length=10000)
for(i in 1:10000){                     # Runnning the following procedure 10k times
n1a = rnorm(10, mean=0,  sd=1)       # Drawing 2 samples of sizes 90/10 from 2 normal
  n2a = rnorm(90, mean=.5, sd=1)       # distributions w/ dif means, but equal SDs
n1b = rnorm(50, mean=0,  sd=1)       # Same, but samples are 50/ 50
  n2b = rnorm(50, mean=.5, sd=1)
n1c = rnorm(100, mean=0,  sd=1)      # A 90/10 sample, with more observations
  n2c = rnorm(900, mean=.5, sd=1)
n1d = rnorm(300, mean=0,  sd=1)      # A 90/10 sample with 3000 total observations
  n2d = rnorm(2700, mean=.5, sd=1)
power1090[i] = t.test(n1a, n2a, var.equal=T)$p.value         # here t-tests are run &
  power5050[i] = t.test(n1b, n2b, var.equal=T)$p.value         # the p-values are stored
  power100900[i] = t.test(n1c, n2c, var.equal=T)$p.value       # for each version 
  power3002700[i] = t.test(n1d, n2d, var.equal=T)$p.value

}
mean(power1090<.05)                 # The powe for a 90/10 sample is 32%.
[1] 0.3203
mean(power5050<.05)                 # For the 50/50 sample, the power increases to 70%. 
[1] 0.7001                          # This is clearly an improvement.
mean(power100900<.05)               # But with much larger samples, the power is close

[1] 0.9967                          # to 100%, even with uneven samples.
 mean(power3002700<.05)

[1] 1

The results show how a 50/50 sample is better than 90/10 with $N=100$. But as the number of observations grow, the power of the t-test approaches 100%, even with a 90/10 split.

This leads me back to my initial opinion, that there are no reasons to reduce a sample to produce even groups, assuming that we are talking about sufficiently large samples. Does the community agree?

While the case of the t-test is very clear, in more complex situations the unequal samples can influence a test outcome. From my own questions I recall https://stats.stackexchange.com/questions/575039/ and https://stats.stackexchange.com/questions/449322 in such cases people might use the sub-sampling of the larger group (although it is a rigorous method and giving different weights of the one group in the cost function, might be a better method) — Sextus Empiricus, May 05 '23 at 08:07

Glen_b · Accepted Answer · 2023-05-04T23:11:38.023

Your conclusion is correct (don't leave out subjects) but I'm concerned that you're considering that other answer in addressing it.

I think you're potentially taking the wrong lesson from the linked post (and if not you, someone else will). It's not at all relevant to the circumstance you're discussing.

There's two different situations. It would be a serious mistake to try to carry the lesson from one situation over to the other.

Situation 1: "I am planning an experiment. I can afford some number of subjects (e.g. 100), how should I split them across two treatment groups for the best power?"

Answer: If the variance in each group and cost per subject for each treatment group are assumed to be the same, split your available subjects equally to the groups.

Note that in no sense does this involve "leaving out data". You're allocating all subjects to groups.

Situation 2: "I have already observed two very different sample sizes. Should I leave out some data from the larger group at random to make the sample sizes more equal?"

Answer: No! Why would leaving out information be any help? You'll lose power, for no obvious gain.

You correctly identified that leaving out data won't help, but the considerations of situation 1 is a distraction from that conclusion.

The answer to Situation 1 (split equally) does not imply you should try to even up sample sizes post hoc in Situation 2. The answer to Situation 2 (don't even up unequal sample sizes) does not imply that you should prefer unequal sample sizes in Situation 1.

Imagine you were in Situation 1 and you allocated 50-50. However, due to circumstances completely unrelated to the values of the experimental response, the observations for 40 of the subjects in treatment group 1 were lost (that is, 80% of the values in that group were missing completely at random). Should we now leave out results for 40 of the subjects in group 2?

No! The happenstance of the missingness came after the experiment, leaving us in situation 2 (we observed $n_1=10$ vs $n_2=50$).

(However if the missingness were potentially related to the outcome, then you would have a problem. Not one solved by leaving out data, though.)

Sextus Empiricus · Answer 2 · 2023-05-05T07:57:24.027

If you assume normal distributed variables, then power can be computed without the need for simulations.

The $t$-statistic is computed as the ratio of the differences in the sample means divided by an estimate of the standard deviation of that difference.

$$t = \frac{\bar{X}_1 - \bar{X}_2}{\hat{s}}$$

where $$\hat{s} = \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\sqrt{\frac{(n_1-1)S_{X_1}^2+(n_2-1)S_{X_2}^2}{n_1+n_2-2}}$$

If the population means are equal then the $t$-statistic will be t-distributed with $n_1+n_2-2$ degrees of freedom.
If the population means are unequal, say $\delta = (\mu_1-\mu_2)/\sigma \neq 0$, then the $t$-statistic will be non-central t-distributed with $n_1+n_2-2$ degrees of freedom. And the parameter of non-centrality will be $$ncp = \frac{\delta}{\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$

Intuitively, the estimate of the $\hat{s}$ is not influenced by the unequal sample sizes (if the assumption of equal variance is correct). But the numerator is $$\hat{X}_1 - \hat{X}_2 \sim N\left(\delta \sigma, \left(\frac{1}{n_1}+\frac{1}{n_2}\right)\sigma^2\right)$$ and the standardized variable is

$$z = \frac{\hat{X}_1 - \hat{X}_2}{\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\sigma} \sim N\left(\frac{\delta}{\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}, 1\right)$$

Here you may recognize why the power increases when we increase $n_1$ or $n_2$: With increasing sample size, the standard deviation of the sample means decrease, and the effect size $\delta$ relative to the standard deviation of the means increases.

So you can compute the critical value/region for the t-statistic using the central t-distribution and with that region compute the power with the non-central t-distribution. Here is an r-code that does this

ttestpower = function(n1,n2,delta,alpha) {
  nu = n1+n2-2
  ncp = delta/sqrt(1/n1 + 1/n2)
  boundary = qt(alpha/2,nu)                            ### regions for alpha level t-test
  power = 1-pt(-boundary,nu,ncp)+pt(boundary,nu,ncp)     ### probability that non-central t-distribution is outside region
  return(power)
}
ttestpower(n1 = 10, n2 = 90, delta = 0.5, alpha = 0.05)    # 0.3178022
ttestpower(n1 = 50, n2 = 50, delta = 0.5, alpha = 0.05)    # 0.6968934
ttestpower(n1 = 100, n2 = 900, delta = 0.5, alpha = 0.05)  # 0.9972727
ttestpower(n1 = 300, n2 = 2700, delta = 0.5, alpha = 0.05) # 1

This leads me back to my initial opinion, that there are no reasons to reduce a sample to produce even groups, assuming that we are talking about sufficiently large samples.

With reducing sample sizes the power decreases.

But, reducing the size of the larger group can still make sense. But, only if also at the same time the size of the smaller group is increased.

For a given fixed sum $n_1 + n_2$ the term $\sqrt{1/n_1 + 1/n_2}$ is smallest if $n_1 = n_2$. This is also true when $n_1 + n_2$ are larger. Compare for instance the situation when the effect size is smaller:

ttestpower(n1 = 300, n2 = 2700, delta = 0.1, alpha = 0.05)  # 0.3756569
ttestpower(n1 = 1500, n2 = 1500, delta = 0.1, alpha = 0.05) # 0.7816494

The power will be almost twice larger when we have two groups of 1500 instead of a group of 300 and a group of 2700.

The difference in means t-test with unequal sample sizes

2 Answers2