28

I have read somewhere in the literature that the Shapiro–Wilk test is considered to be the best normality test because for a given significance level, $\alpha$, the probability of rejecting the null hypothesis if it's false is higher than in the case of the other normality tests.

Could you please explain to me, using mathematical arguments if possible, how exactly it works compared to some of the other normality tests (say the Anderson–Darling test)?

Silverfish
  • 23,353
  • 27
  • 103
  • 201
syntagma
  • 529
  • 5
    Note that the power depends on the way in which the null hypothesis is false, which for a general-purpose goodness-of-fit test can be any of innumerable ways. Without having checked I'd still bet that each of the common normality tests is most powerful against certain alternatives. – Scortchi - Reinstate Monica Mar 20 '14 at 11:34
  • 7
    Not the answer you seek, perhaps, but I'd say that the best normality test is a normal probability plot, i.e. a quantile-quantile plot of observed values versus normal quantiles. The Shapiro-Wilk test is indeed often commended, but it can't tell you exactly how your data differ from a normal. Often unimportant differences are flagged by the test, because they do qualify as significant for large sample sizes, and the opposite problem can also bite you. – Nick Cox Mar 20 '14 at 11:51
  • 1

4 Answers4

21

First a general comment: Note that the Anderson-Darling test is for completely specified distributions, while the Shapiro-Wilk is for normals with any mean and variance. However, as noted in D'Agostino & Stephens$^{[1]}$ the Anderson-Darling adapts in a very convenient way to the estimation case, akin to (but converges faster and is modified in a way that's simpler to deal with than) the Lilliefors test for the Kolmogorov-Smirnov case. Specifically, at the normal, by $n=5$, tables of the asymptotic value of $A^*=A^2\left(1+\frac{4}{n}-\frac{25}{n^2}\right)$ may be used (don't be testing goodness of fit for n<5).

I have read somewhere in the literature that the Shapiro–Wilk test is considered to be the best normality test because for a given significance level, α, the probability of rejecting the null hypothesis if it's false is higher than in the case of the other normality tests.

As a general statement this is false.

Which normality tests are "better" depends on which classes of alternatives you're interested in. One reason the Shapiro-Wilk is popular is that it tends to have very good power under a broad range of useful alternatives. It comes up in many studies of power, and usually performs very well, but it's not universally best.

It's quite easy to find alternatives under which it's less powerful.

For example, against light tailed alternatives it often has less power than the studentized range $u=\frac{\max(x)−\min(x)}{sd(x)}$ (compare them on a test of normality on uniform data, for example - at $n=30$, a test based on $u$ has power of about 63% compared to a bit over 38% for the Shapiro Wilk).

The Anderson-Darling (adjusted for parameter estimation) does better at the double exponential. Moment-skewness does better against some skew alternatives.

Could you please explain to me, using mathematical arguments if possible, how exactly it works compared to some of the other normality tests (say the Anderson–Darling test)?

I will explain in general terms (if you want more specific details the original papers and some of the later papers that discuss them would be your best bet):

Consider a simpler but closely related test, the Shapiro-Francia; it's effectively a function of the correlation between the order statistics and the expected order statistics under normality (and as such, a pretty direct measure of "how straight the line is" in the normal Q-Q plot). As I recall, the Shapiro-Wilk is more powerful because it also takes into account the covariances between the order statistics, producing a best linear estimator of $\sigma$ from the Q-Q plot, which is then scaled by $s$. When the distribution is far from normal, the ratio isn't close to 1.

By comparison the Anderson-Darling, like the Kolmogorov-Smirnov and the Cramér-von Mises, is based on the empirical CDF. Specifically, it's based on weighted deviations between ECDF and theoretical ECDF (the weighting-for-variance makes it more sensitive to deviations in the tail).

The test by Shapiro and Chen$^{[2]}$ (1995) (based on spacings between order statistics) often exhibits slightly more power than the Shapiro-Wilk (but not always); they often perform very similarly.

--

Use the Shapiro Wilk because it's often powerful, widely available and many people are familiar with it (removing the need to explain in detail what it is if you use it in a paper) -- just don't use it under the illusion that it's "the best normality test". There isn't one best normality test.

[1]: D’Agostino, R. B. and Stephens, M. A. (1986)
Goodness of Fit Techniques,
Marcel Dekker, New York.

[2]: Chen, L. and Shapiro, S. (1995)
"An Alternative test for normality based on normalized spacings."
Journal of Statistical Computation and Simulation 53, 269-287.

Glen_b
  • 282,281
  • My classmate told me:"If sample size >50,you should use Kolmogorov-Smirnov." Is that correct? – kittygirl Apr 04 '19 at 03:18
  • 1
    No. To my recollection the original 1965 paper by Shapiro and Wilk only gave the required constants ($a_i$) used in the linear estimate of $\sigma$ for $n$ up to $50$ but that was over half a century ago. Things have moved on a little since then. Even without that, the Shapiro Francia or the Anderson-Darling (also adjusted for parameter estimation) are usually better choices; these often have considerably lower power against typically interesting alternatives. (& if you're estimating mean and sd from the sample, you're not strictly doing a Kolmogorov-Smirnov, but rather a Lilliefors test) – Glen_b Apr 04 '19 at 04:20
  • In short, there was a brief period of a few years post 1967 (the initial publication of Lilliefors' work) where it might have been a justifiable piece of advice, but not for a long time since – Glen_b Apr 04 '19 at 04:24
  • When sample size>5000, run shapiro.test in R will get error sample size must be between 3 and 5000.Then, what else test should be used? – kittygirl Apr 04 '19 at 09:25
  • At large n you'll almost always reject any simple distributional model (even when it's quite a suitable approximation); it may be more advisable to do something else (why are you testing normality?) 2. It's not really a matter of "should" about it; there's no single goodness of test that's always better than any other. It just happens that the Shapiro Wilk is reasonably good. However, a suitable alternative at large n is the Shapiro-Francia test. If you can find an implementation of the Chen-Shapiro test at large n (assuming there's a good reason to test at all), consider that instead.
  • – Glen_b Apr 04 '19 at 12:01
  • Testing normality is for testing IQ score norm.If my IQ score data(n=20k) is not normal distribution,even Skewness!=0,I would expose psychologist use artificial normalized(not real data) to measure people. – kittygirl Apr 04 '19 at 12:48
  • I don't follow what you're saying, but if IQ scores can't be negative they couldn't actually be normally distributed (though they may be quite close). By the same token if they can't be negative but they do have some probability of exceeding 200 then they cannot be symmetrically distributed. We don't need data .for that. If we know before we look that the true distribution cannot be exactly normal (only approximately normal), what would the test tell us that we didn't already know? With such a large n a rejection would occur even with very small deviations from normality, but so what? – Glen_b Apr 04 '19 at 14:30