Emperically testing a "p-test" (of multiple fair coin flips)

Question

In a previous question, I asked how to test if a coin is fair. Now I want to empirically test if this test works.

One answer is that programs like R and python have built-in "binomial p-tests" that can be called to do this.

Here is an example of some python code to do such a p-test for a single case of flipping 1000 fair coins:

import numpy as np
from numpy import random
import scipy
from scipy import stats
def flipAndDoPTest(numberOfFlips, weight):
    flippedCoinSimulation = np.random.binomial(1, weight, numberOfFlips) #first input changes sum of coins
    numberOfHeads = np.sum(flippedCoinSimulation==1)
    numberOfTails = np.sum(flippedCoinSimulation==0)
    pvalue = stats.binom_test(numberOfHeads, numberOfFlips, weight)
    return pvalue
numberOfFlips = 1000
weight = .50
ptestvalue = flipAndDoPTest(numberOfFlips, weight)
if ptestvalue>.05:
    print("the ptest has a value of:", ptestvalue)
    print("The null hypothesis cannot be rejected at the 5% level of significance because the returned p-value is greater than the critical value of 5%.")
if ptestvalue<.05:
    print("the ptest has a value of:", ptestvalue)
    print("The null hypothesis can be rejected at the 5% level of significance because the returned p-value is less than the critical value of 5%.")

Now I want to empirically test what this "5% level of significance" means. It seems like there is a lot of disagreement over the interpretation of p-values, so I want to just simulate what occurs in my case.

First I would like to test if 5% of the time a p-value less than .05 will occur for a fair coin. To do this, I repeat this p-test 1000 times (and each p-test is for the event of flipping a fair coin 10000 times). Now I collect all of the times the p-value is less than .05. The code for this is here:

numberOfFlips = 10000
weight = .50
numberOfTests = 1000
StatisticalSignificanceProbability = .05
pTestList = np.zeros(numberOfTests) #initialization
for i in range(numberOfTests):
    #for each i in the loop, do a p-test of 10,000 fair coin flips and add it to a list
    ithPTest = flipAndDoPTest(numberOfFlips, weight)
    pTestList[i] = ithPTest
#take this list and count all of the times there are cases below .05
numberOfSheerCoincidences = sum(pTestList<StatisticalSignificanceProbability)
expectedNumberOfSheerCoincidences = numberOfTests*StatisticalSignificanceProbability
print("numberOfSheerCoincidences: ", numberOfSheerCoincidences)
print("expectedNumberOfSheerCoincidences: ", expectedNumberOfSheerCoincidences)

Now I expect that 5% of my 1000 p-tests are going to be less than .05 (so .05*1000 = 50). But every time I run it I get a number that is significantly smaller than 50. Now this result has a random distribution, so I then wrote code to repeat this procedure to get a histogram distribution of the resultant data:

numberOfFlips = 100
weight = .50   
numberOfDataPoints = 1000
pTestResultsDataPoints = np.zeros(numberOfDataPoints) #initialization
for j in range(numberOfDataPoints):
    #repeating this collection of p-test to get a range of different values
    numberOfTests = 1000
    StatisticalSignificanceProbability = .05
    pTestList = np.zeros(numberOfTests) #initialization
    for i in range(numberOfTests):
        ithPTest = flipAndDoPTest(numberOfFlips, weight)
        pTestList[i] = ithPTest
    numberOfSheerCoincidences = sum(pTestList<StatisticalSignificanceProbability)
    expectedNumberOfSheerCoincidences = numberOfTests*StatisticalSignificanceProbability
    pTestResultsDataPoints[j] = numberOfSheerCoincidences
n, bins, patches = plt.hist(pTestResultsDataPoints, 50)
plt.show()

And with that result I get a distribution that is centered around 35 and not 50.

Is this result expected? I was expecting a normal distribution around 50.

score 4 · Answer 1 · answered Jun 19 '21 at 19:05

It is (nearly) impossible for a test statistic that has a discrete distribution to attain exactly a 5% (or any other) significance level. So instead, many so-called "exact" tests use the largest attainable significance level that is less than $\alpha$ (or 5% in your case). That is what you are seeing.

Here is a reference that elaborates:

https://www.jstor.org/stable/2684683?origin=crossref

BruceET · Accepted Answer · 2021-06-19T19:52:43.917

In general, there is no such binomial test with significance level exactly $\alpha = 0.05,$ on account of the discreteness of binomial distributions.

For an exact test at $\alpha = 0.05$ based on a continuous test statistic, the distribution of P-value when $H_0$ is true would be standard uniform and the probability that the P-value is below $0.05$ would be exactly $0.05.$

If $n = 100,$ testing $H_0: p = .5$ against $H_a: p \ne 0.5,$ the closest one can get to a test at the 5% level (without going over 5%) is $0.0352 = 3.52\%.$

2*(1 - pbinom(60, 100, .5))
[1] 0.0352002
2*(1 - pbinom(59, 100, .5))
[1] 0.05688793

[It does not help to use a normal approximation with nominal 5% level because z values close to $\pm 1.96$ cannot be achieved. For $\mathsf{Binom}(100,.5)$ the normal approximation is very accurate so it hardly matters whether one does an 'exact' binomial test or an approximate normal one.]

Below I simulate 100,000 tests in R for $n=100$ observations, and summarize and plot histograms of P-values. Probabilities should be accurate to about two places. The binomial test 'binom.test' and approximate normal test 'prop.test' both have the anticipated P-values.

set.seed(2021);  n = 100
pv.b = replicate(10^5, 
        binom.test(rbinom(1,n,.5),n,.5)$p.val)
mean(pv.b < 0.05)
[1] 0.03605        # aprx 0.0352
2*sd(pv.b < 0.05)/sqrt(10^5)
[1] 0.001178995    # aprx 95% margin of sim error
set.seed(2021);  n = 100
pv.n = replicate(10^5, 
        prop.test(rbinom(1,n,.5),n,.5)$p.val)
mean(pv.n < 0.05)
[1] 0.03605

The figures below show the simulated distributions of the P-values under $H_0.$

R code for figure:

par(mfrow=c(1,2))
hist(pv.b, prob=T, xlim=c(-.01,1.01), col="skyblue2")
 abline(v = .05, col="red")
 curve(dunif(x), add=T, n=10001, lwd=2)
hist(pv.n, prob=T, xlim=c(-.01,1.01), col="skyblue2")
 abline(v = .05, col="red")
 curve(dunif(x), add=T, n=10001, lwd=2)
par(mfrow=c(1,1))

Notes: (1) At the resolution of the plot, the two histograms look the same, but there are miniscule differences in the distributions; at the borderline the binomial and approximate normal tests had slightly different P-values in 8070 tests out of 100,000. But they never disagreed about rejection at the 5% level.

sum(pv.b == pv.n)
[1] 8070
mean(abs(pv.b - pv.n))
[1] 0.0001792384
sum((pv.n <= .05) == (pv.n <=.05))
[1] 100000

(2) Without using a randomized test, the closest $\alpha$ to 5% without going over for $n=1000$ is $0.046 = 4.6\%,$

2*(1-pbinom(531,1000,.5))
[1] 0.0462912
2*(1-pbinom(530,1000,.5))
[1] 0.05367785

Emperically testing a "p-test" (of multiple fair coin flips)

2 Answers2

Linked