5

I am interested in generating Gaussian mixture distributions as the null distributions for a series of two-sample test simulations. It is a well established fact that p-values follow a uniform distribution when the null hypothesis is true, and several excellent explanations have been offered on Stack Exchange. I observe this when testing between two Gaussian distributions, but as soon as I begin to test between two multi-modal Gaussian mixture distributions the distribution of p-values departs from uniformity and exhibits negative skewness. Why am I observing this behavior?

Examples

Below I demonstrate the expected uniform distribution of p-values from a two-sample Anderson-Darling test using the kSamples R package. (Please forgive the inefficiently written parts of the code.)

library(kSamples)

numsims <- 1000 repNums <- 1:numsims pvalDF <- data.frame(ii = rep(0, length(repNums)), pvals = rep(0, length(repNums)), seed = rep(0, length(repNums)))

for (ii in 1:numsims) { set.seed(123 + ii)

sample1 <- rnorm(50, mean = 0.5, sd = 0.05) sample2 <- rnorm(50, mean = 0.5, sd = 0.05)

n1 = length(sample1) n2 = length(sample2) samples <- scale(c(sample1, sample2)) sample1 <- samples[1:n1] sample2 <- samples[n1+1:n2]

ad_out <- ad.test(sample1, sample2)

pvalDF$ii[ii] <- ii pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3] pvalDF$seed[ii] <- 321 + ii }

hist(pvalDF$pvals, xlim = c(0, 1))

Histogram of approximately uniformly distributed p-values

However, as soon as the null distribution is constructed as a mixture of two Gaussian distributions the distribution of p-values departs from uniformity:

numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
                     pvals = rep(0, length(repNums)),
                     seed = rep(0, length(repNums)))

for (ii in 1:numsims) { set.seed(123 + ii)

sample1.1 <- rnorm(25, mean = 0.5, sd = 0.05) # Here are the new two samples each sample1.2 <- rnorm(25, mean = 0.75, sd = 0.05) # generated as a mixture from two sample1 <- c(sample1.1, sample1.2) # identical Gaussian distributions. sample2.1 <- rnorm(25, mean = 0.5, sd = 0.05) # sample2.2 <- rnorm(25, mean = 0.75, sd = 0.05) # sample2 <- c(sample2.1, sample2.2) #

n1 = length(sample1) n2 = length(sample2) samples <- scale(c(sample1, sample2)) sample1 <- samples[1:n1] sample2 <- samples[n1+1:n2]

ad_out <- ad.test(sample1, sample2)

pvalDF$ii[ii] <- ii pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3] pvalDF$seed[ii] <- 321 + ii }

hist(pvalDF$pvals, xlim = c(0, 1))

Histogram of non-uniformly distributed p-values

This seems counter-intuitive to me. Both null hypotheses (the underlying distributions for the two samples are equal) are true, yet, the p-values are no longer uniformly distributed for the latter bimodal samples.

While somewhat redundant, this effect is exaggerated further by adding additional Gaussian distributions to both samples:

numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
                     pvals = rep(0, length(repNums)),
                     seed = rep(0, length(repNums)))

for (ii in 1:numsims) { set.seed(123 + ii)

sample1.1 <- rnorm(10, mean = 0.5, sd = 0.05) sample1.2 <- rnorm(10, mean = 0.75, sd = 0.05) sample1.3 <- rnorm(10, mean = 1, sd = 0.05) sample1.4 <- rnorm(10, mean = 1.25, sd = 0.05) sample1.5 <- rnorm(10, mean = 1.5, sd = 0.05) sample1 <- c(sample1.1, sample1.2, sample1.3, sample1.4, sample1.5) sample2.1 <- rnorm(10, mean = 0.5, sd = 0.05) sample2.2 <- rnorm(10, mean = 0.75, sd = 0.05) sample2.3 <- rnorm(10, mean = 1, sd = 0.05) sample2.4 <- rnorm(10, mean = 1.25, sd = 0.05) sample2.5 <- rnorm(10, mean = 1.5, sd = 0.05) sample2 <- c(sample2.1, sample2.2, sample2.3, sample2.4, sample2.5)

n1 = length(sample1) n2 = length(sample2) samples <- scale(c(sample1, sample2)) sample1 <- samples[1:n1] sample2 <- samples[n1+1:n2]

ad_out <- ad.test(sample1, sample2)

pvalDF$ii[ii] <- ii pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3] pvalDF$seed[ii] <- 321 + ii }

hist(pvalDF$pvals, xlim = c(0, 1))

Negatively skewed histogram of p-values

What is causing this effect? I don't believe this is related to this post as these p-values can take on a much larger number of possible values. I have also reproduced this effect using a different test designed for bivariate samples.

1 Answers1

6

You aren't generating two samples of independent observations from a Gaussian mixture, because you are fixing the number taken from each component rather than making it random.

If $X$ is a 50/50 mixture of $N(.5,0.05^2)$ and $N(0.075,0.05^2)$, and you sample $n$ observations, the number of observations from the first component is Binomial$(n, 0.5)$, not $n/2$. When you fix the number of observations from each component at $n/2$, the observations within each sample are not independent, so the two samples are more similar than you'd expect from independent observations and the p-values are larger than $U[0,1]$

I modified your code to sample the two components randomly

numsims <- 1000
repNums <- 1:numsims
pvalDF <- data.frame(ii = rep(0, length(repNums)),
                     pvals = rep(0, length(repNums)),
                     seed = rep(0, length(repNums)))

for (ii in 1:numsims) { set.seed(123 + ii)

m1<-rbinom(1,50,.5) sample1.1 <- rnorm(m1, mean = 0.5, sd = 0.05) # Here are the new two samples each sample1.2 <- rnorm(50-m1, mean = 0.75, sd = 0.05) # generated as a mixture from two sample1 <- c(sample1.1, sample1.2) # identical Gaussian distributions.

m2<-rbinom(1,50,.5) sample2.1 <- rnorm(m2, mean = 0.5, sd = 0.05) # sample2.2 <- rnorm(50-m2, mean = 0.75, sd = 0.05) # sample2 <- c(sample2.1, sample2.2) #

n1 = length(sample1) n2 = length(sample2) samples <- scale(c(sample1, sample2)) sample1 <- samples[1:n1] sample2 <- samples[n1+1:n2]

ad_out <- ad.test(sample1, sample2)

pvalDF$ii[ii] <- ii pvalDF$pvals[ii] <- ad_out$ad["version 1:", 3] pvalDF$seed[ii] <- 321 + ii }

hist(pvalDF$pvals, xlim = c(0, 1))

and the p-value distribution is boring again

enter image description here

Thomas Lumley
  • 38,062
  • Is the "n1+1:n2" actually fine? (See my comment for the original question.) – Christian Hennig Jun 03 '21 at 22:46
  • Yes, it's correct. n2 is the length of the second sample, so n1+1:n2 starts at the first index of the second sample, n1+1, and goes to the last index, n1+n2 – Thomas Lumley Jun 03 '21 at 22:55
  • Yeah. Stupid me. – Christian Hennig Jun 03 '21 at 23:03
  • 1
    Yes. This answer provided the piece I had overlooked. Thank you! Also, for those who may be interested in how this answer can be applied to my third example, using m1s <- rmultinom(1, 50, rep(0.2, 5))[, 1] and m2s <- rmultinom(1, 50, rep(0.2, 5))[, 1] will provide vectors of subsample sizes for the rnorm()s to achieve uniformly distributed p-values there as well. – computationalstatistician Jun 04 '21 at 06:32