In a previous question, I asked how to test if a coin is fair. Now I want to empirically test if this test works.
One answer is that programs like R and python have built-in "binomial p-tests" that can be called to do this.
Here is an example of some python code to do such a p-test for a single case of flipping 1000 fair coins:
import numpy as np
from numpy import random
import scipy
from scipy import stats
def flipAndDoPTest(numberOfFlips, weight):
flippedCoinSimulation = np.random.binomial(1, weight, numberOfFlips) #first input changes sum of coins
numberOfHeads = np.sum(flippedCoinSimulation==1)
numberOfTails = np.sum(flippedCoinSimulation==0)
pvalue = stats.binom_test(numberOfHeads, numberOfFlips, weight)
return pvalue
numberOfFlips = 1000
weight = .50
ptestvalue = flipAndDoPTest(numberOfFlips, weight)
if ptestvalue>.05:
print("the ptest has a value of:", ptestvalue)
print("The null hypothesis cannot be rejected at the 5% level of significance because the returned p-value is greater than the critical value of 5%.")
if ptestvalue<.05:
print("the ptest has a value of:", ptestvalue)
print("The null hypothesis can be rejected at the 5% level of significance because the returned p-value is less than the critical value of 5%.")
Now I want to empirically test what this "5% level of significance" means. It seems like there is a lot of disagreement over the interpretation of p-values, so I want to just simulate what occurs in my case.
First I would like to test if 5% of the time a p-value less than .05 will occur for a fair coin. To do this, I repeat this p-test 1000 times (and each p-test is for the event of flipping a fair coin 10000 times). Now I collect all of the times the p-value is less than .05. The code for this is here:
numberOfFlips = 10000
weight = .50
numberOfTests = 1000
StatisticalSignificanceProbability = .05
pTestList = np.zeros(numberOfTests) #initialization
for i in range(numberOfTests):
#for each i in the loop, do a p-test of 10,000 fair coin flips and add it to a list
ithPTest = flipAndDoPTest(numberOfFlips, weight)
pTestList[i] = ithPTest
#take this list and count all of the times there are cases below .05
numberOfSheerCoincidences = sum(pTestList<StatisticalSignificanceProbability)
expectedNumberOfSheerCoincidences = numberOfTests*StatisticalSignificanceProbability
print("numberOfSheerCoincidences: ", numberOfSheerCoincidences)
print("expectedNumberOfSheerCoincidences: ", expectedNumberOfSheerCoincidences)
Now I expect that 5% of my 1000 p-tests are going to be less than .05 (so .05*1000 = 50). But every time I run it I get a number that is significantly smaller than 50. Now this result has a random distribution, so I then wrote code to repeat this procedure to get a histogram distribution of the resultant data:
numberOfFlips = 100
weight = .50
numberOfDataPoints = 1000
pTestResultsDataPoints = np.zeros(numberOfDataPoints) #initialization
for j in range(numberOfDataPoints):
#repeating this collection of p-test to get a range of different values
numberOfTests = 1000
StatisticalSignificanceProbability = .05
pTestList = np.zeros(numberOfTests) #initialization
for i in range(numberOfTests):
ithPTest = flipAndDoPTest(numberOfFlips, weight)
pTestList[i] = ithPTest
numberOfSheerCoincidences = sum(pTestList<StatisticalSignificanceProbability)
expectedNumberOfSheerCoincidences = numberOfTests*StatisticalSignificanceProbability
pTestResultsDataPoints[j] = numberOfSheerCoincidences
n, bins, patches = plt.hist(pTestResultsDataPoints, 50)
plt.show()
And with that result I get a distribution that is centered around 35 and not 50.

Is this result expected? I was expecting a normal distribution around 50.
