12

I'm reading web pages for goodness of fit tests, when I came to the Anderson–Darling test and the Cramér–von Mises criterion.

So far I got the point; it seems the Anderson–Darling test and the Cramér–von Mises criterion are similar, just based on a different weighting function $w$. Also there's a variant of the Cramér–von Mises criterion named the Watson test.

Basically I have two questions here

  1. There are not many Google results about these two methods; are they still state-of-the-art? or replaced by some better approaches already?

    It's a bit of a surprise, as according to this paper on power comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson-Darling tests, AD is performing quite well; always better than Lilliefors and KS, and very close to the SW test, which is specifically designed for the normal distribution.

  2. What is the confidence interval for such tests?

    For the AD, CM and Watson tests, I saw the test statistics variable defined on the wiki pages, but didn't find the confidence interval.

    Things are just more straightforward for the KS test: on the wiki page, the confidence interval is defined by $K_\alpha$, which is defined from the cumulative distribution function of $K$.

athos
  • 421

2 Answers2

8

There can be no single state-of-the-art for goodness of fit (for example no UMP test across general alternatives will exist, and really nothing even comes close -- even highly regarded omnibus tests have terrible power in some situations).

In general when selecting a test statistic you choose the kinds of deviation that it's most important to detect and use a test statistic that is good at that job. Some tests do very well at a wide variety of interesting alternatives, making them decent default choices, but that doesn't make them "state of the art".

The Anderson Darling is still very popular, and with good reason. The Cramer-von Mises test is much less used these days (to my surprise because it's usually better than the Kolmogorov-Smirnov, but simpler than the Anderson-Darling -- and often has better power than it on differences "in the middle" of the distribution)

All of these tests suffer from bias against some kinds of alternatives, and it's easy to find cases where the Anderson-Darling does much worse (terribly, really) than the other tests. (As I suggest, it's more 'horses for courses' than one test to rule them all). There's often little consideration given to this issue (what's best at picking up the deviations that matter the most to me?), unfortunately.

You may find some value in some of these posts:

Is Shapiro–Wilk the best normality test? Why might it be better than other tests like Anderson-Darling?

2 Sample Kolmogorov-Smirnov vs. Anderson-Darling vs Cramer-von-Mises (about two-sample tests but many of the statements carry over

Motivation for Kolmogorov distance between distributions (more theoretical discussion but there are several important points about practical implications)


I don't think you'll be able to form a confidence interval for the cdf in the Cramer-von Mises and Anderson Darline statistics, because the criteria are based on all of the deviations rather than just the largest.

Glen_b
  • 282,281
  • I took "state of the art" to mean something that finds use that is not obsolescent. The existence of multiple goodness-of-fit definitions should signal to us that goodness-of-fit is not a single concept. Consider that "good" depends on "why" we are performing regression. Suppose we are fitting Model A to data B to obtain a best predictor of effect C. Then "good" is the best predictor of C not B. However, most often the question of how B and C differ is ignored. – Carl Aug 22 '16 at 17:12
  • 1
    @Carl you may want to check a dictionary (or wikipedia) on what state of the art is usually taken to mean -- your interpretation of the phrase is not how most people read the phrase. Dictionaries say things like this: "the most recent stage in development, incorporating the newest ideas" and "the highest level of development at a given time" and "cutting edge, using the latest technology". In this context -- testing goodness of fit -- the phrase implies "the best we can possibly do right now". I insist that's not something you can really say about any single test. ... ctd – Glen_b Aug 22 '16 at 21:44
  • 2
    ... e.g. We can say that popular tests like the Shapiro-Wilk (while very popular in testing normality) have competitors with broadly better power (e.g. see Shapiro&Chen 1995)-- but not in every situation. There's no single best choice of test (and hence, no actual 'state of the art'). Certainly I agree that what is best (state of the art) depends on circumstances --- that's the point of my answer; the possible answers are myriad - something good in one situation may be very poor in another. It pays to know when tests perform well rather than ask for "what's best" as if it were a single thing. – Glen_b Aug 22 '16 at 21:47
  • True, your definition is more correct. However, there are many more methods than tests of methods, and the "state of the art" is largely fiction, i.e., the "art" has no "state" all it has are protagonists. Any response to such a nebulous posit is equivocal. I said 'yes' and you said 'no' and we both said the same thing. – Carl Aug 23 '16 at 01:47
  • BTW, the question was "state of the art" or "replaced" which I took to mean "obsolete, or not obsolete". So there was a context for my answer which context was "Please assume that 'state of the art' and 'replace' are antonyms, and please choose one of those." You are correct that those are not antonyms, I was answering in context and you chose to beg the question. So, mine was the polite answer. And, I am going to vote for your answer, because I think it informative, if not overly polite. – Carl Aug 24 '16 at 02:04
  • A little late to respond, but in any case, to clarify -- to "beg the question" is to assume what is to be proved. What I did was deny the premise. I'll wear the accusation of being impolite. – Glen_b Jun 02 '23 at 00:34
  • Good grief. A state can have numerous elements as in thermodynamic state and so too can the state of the art. Each element of the state of the art could have a different date of first application and the criterion is whether it is in current use, or not how splendorous or novel it is. True, people reach for the hyperbolic when they use the phrase, but that is connotative not intrinsic. – Carl Jun 02 '23 at 01:06
3

The Anderson-Darling test is not available on all distributions but has power that is good and close to the power for the Shapiro-Wilk test except for small numbers of samples so that the two are equivalent at $n=400$ Razali NM, Wah YB. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics. 2011;2:21-33. However, the Shapiro-Wilk test is only for normal distribution testing. The Cramér–von Mises test and Pearson Chi-squared are general for all distribution fits to histograms and I think that Cramér–von Mises test has more power than Pearson Chi-squared. The Cramér–von Mises test is often more powerful cumulative density function goodness-of-fit test than the Kolmogorov-Smirnov test for a broad class of alternative hypotheses see 1. and can have power greater or less than t-testing. Chi-squared has difficulty with low cell counts, so range restrictions are used for fitting tails.

**Question 1: ... are ... these two methods... still state-of-the-art? or replaced by some better approaches already? Question 2 What is the confidence interval for such tests? **

Answer: They are state of the art. However, sometimes we want confidence intervals not probabilities. When comparing these methods to each other we speak of power rather than confidence intervals. Sometimes goodness-of-fit is analyzed using AIC, BIC and other criteria as contrasted to probabilities of good fitting, and sometimes the goodness-of-fit criterion is irrelevant, for example, when goodness-of-fit is not the criterion for fitting. In the latter case, our regression target may be a physical quantity not related to fitting, e.g., see Tk-GV.

(1) See, e.g., Section 5 in Stephens (1974) or page 110 in D’Agostino and Stephens (1986). Observe this is only empirical evidence for certain scenarios. Stephens, M. A. 1974. “EDF Statistics for Goodness of Fit and Some Comparisons.” Journal of the American Statistical Association 69 (347): 730–37. https://doi.org/10.2307/2286009. D’Agostino, R. B., and M. A. Stephens, eds. 1986. Goodness-of-Fit Techniques. Vol. 68. Statistics: Textbooks and Monographs. New York: Marcel Dekker. https://www.routledge.com/Goodness-of-Fit-Techniques/DAgostino/p/book/9780367580346.

Carl
  • 13,084
  • 2
    NB The Anderson-Darling test is a weighted version of the Cramer-von Mises test; &, like it, suitable for any continuous distribution. – Scortchi - Reinstate Monica Aug 28 '16 at 18:06
  • 1
    he Cramér–von Mises test is a more powerful cumulative density function goodness-of-fit test than the Kolmogorov-Smirnov test is incorrect. One is more powerful against some alternatives, the other against other alternatives. – Richard Hardy Jun 22 '23 at 17:59
  • @RichardHardy Thanks for your input, Richard. Can you provide some guidance as to when which is better? – Carl Jun 22 '23 at 19:04
  • @Carl, this is a very broad question and I do not have a brief answer to it. I am aware of some related threads on this site and some textbook treatment; I provide links here. But there is more out there in the wild. Normally, if I want to find examples of which test does best when, I look at the test statistic and see what triggers it, then construct an example. Different test statistics get triggered by different things, so these examples should illustrate which test is more powerful when. – Richard Hardy Jun 23 '23 at 09:02
  • 1
    @RichardHardy Agreed. However, the information concerning tests that only apply to specific distributions and those that are more general is not in question. Certain general rules also apply, for example, a nonparametric test may apply to more distribution types at the cost of some sensitivity for each of them. So, it is not as if looking at the algorithms themselves was a waste of time, in fact, the devil in the details includes not only the data, but the algorithms themselves. – Carl Jun 23 '23 at 15:04
  • con't The argument that we need to test specific circumstances for test applicability may be too vague to provide good guidance. For example, non-normality for similar datasets may be variable, with some testing as normal and others, obtained identically, as non-normal. That leaves us with only vague indications of when to use what, and I would rather say something vague, as above, than say nothing because of fear of being incorrect. There are no right answers in statistics, just educated guesses. – Carl Jun 23 '23 at 15:26
  • I would rather say something vague, as above, than say nothing because of fear of being incorrect. In my view, you said something very concrete (the bit I cited above) rather than vague. Otherwise, I would not have been able to identify the mistake so easily. And I would take vague and correct instead of concrete and incorrect any day. – Richard Hardy Jun 23 '23 at 16:13
  • @RichardHardy OK, made some changes to text with more references. I have changed "more" to "often more" and put in three references in a chain to that effect. This now agrees with what you said, even though I have never seen it myself. – Carl Jun 23 '23 at 17:45
  • 1
    I think this has improved your answer. Thank you for taking the time! – Richard Hardy Jun 23 '23 at 18:14
  • @RichardHardy Actually, thank-you for your patience and attention to detail. I vastly prefer your having contributed the time, effort and courtesy of formulating a cautionary note to a downvote, as the latter does nothing to improve the answer and the former can. Would that more contributors were like you. – Carl Jun 23 '23 at 18:38
  • @Carl, I appreciate your kind words very much! – Richard Hardy Jun 23 '23 at 18:57