Why do we need normality test if the sample size is large enough and hence, the distribution of the sample mean is approximately normal based on central limit theorem?
-
3Sometimes the sample size is so small that the asymptotic results of the CLT are not very helpful. (In which case, one should probably be careful about any parametric analysis and result, if if the residuals appear normal.) – Stephan Kolassa Mar 08 '23 at 09:27
-
Indeed. This comment probably is not to imply that nonparametric analyses would be preferable in such tiny samples? – Christoph Hanck Mar 08 '23 at 10:43
-
2At what point -- at what sample size -- do you suppose the CLT gives you any reliable information in general? You might find https://stats.stackexchange.com/questions/69898 to be a useful case to consider. – whuber Mar 08 '23 at 18:16
1 Answers
1. The CLT certainly doesn't solve all problems. For example:
(a) There are distributions for which the CLT doesn't hold.
Here's an example:
This density is a mixture of a symmetric 4-parameter beta and a $t_2$. There's a normal distribution that looks visually pretty close to it (e.g. in the way that a Kolmogorov-Smirnov statistic measures distance, largest absolute difference in cdf), but this distribution does not have a finite variance, and the ordinary central limit theorem fails to hold for this seemingly unremarkable-looking case.
(While it is close in the stated sense to a normal distribution, I did not spend time getting it as close as possible; there are examples that look even closer to a normal, indeed it can be as close as you like.)
(b) There are distributions for which the CLT does hold, but for which even averages of a million observations are not really close to a normal.
There's an example discussed here, a lognormal distribution with sufficiently large shape parameter ($\sigma$).
That number ($n= 10^6$) can be pushed up, beyond any fixed value. That is, there's really no "large enough" that's sufficient to make the distribution of standardized sums or means 'close to normal' for every distribution among the set of distributions for which the CLT does hold.
(c) There are tests that assume normality but which do not involve means. In those cases the CLT isn't necessarily of any direct relevance.
2. None of this is an argument for using formal tests of assumptions. That doesn't really answer the right question.
There's a nice discussion of that point (in relation to testing normality) in Harvey Motulsky's answer here. Much more could be said about testing but that's perhaps not the direct issue here, so I won't labor the point further.
- 282,281
-
Thanks. But CLT says, regardless the shape of the data, if the sample size is large enough, then the distribution of the sample mean is approximately normal. Hence, I understand that it works for all distribution! – Alice Mar 08 '23 at 11:07
-
4
- You have simply been misled about what the central limit theorem says; if they got that wrong, you should worry about what else they got wrong. Was that from a book? 2. An example of a distribution for which it does not apply would be a t-distribution with very small degrees of freedom (any d.f. not exceeding 2). There are many other distributions (infinitely many) for which it doesn't apply.
– Glen_b Mar 08 '23 at 11:09 -
4
-
1@Alice Even if the "all distributions" claim were true (it really isn't), (a) what do you do when the sample size isn't large? (b) how do you know when you have a sample that's "large enough"? – Glen_b Mar 08 '23 at 11:16
-
@Glen_b yes, the theory I have mentioned it was from a book. The title of the book is " business statistics A decision-Making Approach" 10 edition. – Alice Mar 08 '23 at 11:40
-
-
5Thanks, I managed to get google books to show me a snippet ... it looks like you correctly report what the book says. It's simply wrong. I haven't seen this specific book but there are many books with either exactly that error or one very similar to it. I believe that the number of books in print (in terms of total copies) that incorrectly discuss the central limit theorem greatly outnumber the ones that are correct – Glen_b Mar 08 '23 at 11:51
-
Yes. You are correct. Unfortunately, these books (wrong representation of CLT) are required textbooks for some universities. – Alice Mar 08 '23 at 13:23
-
2Within some kinds of academic disciplines, it happens across almost all universities, sadly. The standard in actual stats departments is usually a bit better, fortunately, but sometimes even there you can strike this issue, at least it can if they teach nonmathematical versions of intro courses. It's rare for a book to get this thing wrong on its own; there's a panoply of errors that tend to arrive together as received wisdom, though there's a degree of variation across academic discipline as to what things they tend to be. – Glen_b Mar 09 '23 at 00:35
-
2As a general rule, books on statistics that include specialties in their titles -- "Business statistics," "Statistics for engineers," etc. -- should be avoided when searching for definitive or clear answers to any statistics question. They tend to be written by people unattuned either to the subtleties of basic definitions and procedures or to the nuances of how to teach them well. Even when such a book turns out to be expertly written (there are many), its narrow focus makes it a poor authority concerning basic definitions and concepts. – whuber Mar 09 '23 at 14:10
-
1What gets my goat is that there are applied data science masters program around the US who feel that the CLT is worth emphasizing in their few data analysis courses! I can't understand this for the life of me. For more about the lack of applicability of the CLT in everyday life see here – Frank Harrell Mar 09 '23 at 23:26
