Are there differences for the Shapiro tests in Python and R?

Question

I am aiming to reproduce the issue (using Python) described in the answer here Is normality testing 'essentially useless'?

It states that the assumption of normality is more likely to be violated when the sample size of the studied distribution is large (n>1000) and the Shapiro-Wilk test for normality will become more sensitive to outliers. In other words, the Shapiro-Wilk test will more often extract the p-value as < 0.05 (rejecting the null hypothesis for normality of the distribution) on repetitive simulations of almost normally distributed data.

However, when I tried with Python implementation, there is almost never p<0.05 independently of the sample size I used to generate the distribution. I checked the implementation of the Shapiro-Wilk test both in Python (scipy.stats) and R (stats package) and they use the same algorithm from the paper ALGORITHM AS R94 APPL. STATIST. (1995) VOL. 44, NO. 4.

Why do I not obtain similar results with scipy? I attach my code below.

import pandas as pd
import numpy as np
distributions = []
for _ in range(100):
    tmp_dist =[shapiro(np.concatenate((np.random.normal(0, 1, 10), [1, 0, 2, 0, 1])))[1],
                shapiro(np.concatenate((np.random.normal(0, 1, 100), [1, 0, 2, 0, 1])))[1],
                shapiro(np.concatenate((np.random.normal(0, 1, 1000), [1, 0, 2, 0, 1])))[1],
                shapiro(np.concatenate((np.random.normal(0, 1, 5000), [1, 0, 2, 0, 1])))[1],
                shapiro(np.concatenate((np.random.normal(0, 1, 20000), [1, 0, 2, 0, 1])))[1]]
    distributions.append(tmp_dist)
df = pd.DataFrame(distributions, columns = ['n10','n100','n1000', 'n5000', 'n20000'])

There was some disagreement about the assertions you reference. See the comment thread beginning with my comment at https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless#comment143402_2498. Please note that if by "almost Normally distributed data" you mean a data generation process that isn't Normal, then of course as the sample size grows any decent test will tend to yield small p-values. But your code seems to test only the null distribution, so it ought to generate $p\lt 0.05$ only 5% of the time. What, then, is wrong? — whuber, May 17 '23 at 15:22
Thank you for the link and the comment. So the additional data I concatenate with the normal distribution [1, 0, 2, 0, 1] falls within parameters of the defined normal distribution, therefore shapiro-wilk test for the normality will not be affected, do I understand it right? — makkreker, May 17 '23 at 15:41
The basic ideas are (1) when you generate independent data with a truly Normal distribution, a correct implementation of a Normality test ought to yield a uniform distribution of p-values between $0$ and $1$ (except for small sample sizes) and (2) when you generate data in any other way, as the sample size grows the p-value distribution should shift more and more towards zero. You appear to be checking (1) in Python whereas the thread you reference verifies (a special case of) (2). — whuber, May 17 '23 at 15:53

Are there differences for the Shapiro tests in Python and R?

0 Answers0