2

According to CLT, randomly selecting values from a distribution will result in a convergence towards a normal distribution. Does this mean that we can never figure out whether or not the underlying data is normal or not without seeing the full dataset?

As an example, consider a dataset like:

enter image description here

Could sampling ever allow us to figure out the original bimodal distribution?

  • 2
    the central limit says the mean converges to normal, not the individual values – Alex J Oct 13 '23 at 02:37
  • 2
    A parent distribution can be arbitrarily close to normal while not being normal.

    At any given sample size there will be non-normal distributions that you cannot distinguish from from normality.

    The distribution of the mean of i.i.d. random variables can in turn be arbitrarily close to normal while not being normal but ... where are you getting a distribution of means from unless you have many samples?

    In any case, even if you did have many samples, the same problem arises.

    The big question is ... what are you trying to do, exactly? What's this for?

    – Glen_b Oct 13 '23 at 02:43
  • @Glen_b updated with a graph – JonathanReez Oct 13 '23 at 02:47
  • 4
    Does this answer your question? Debunking wrong CLT statement Your statement about the CLT seems to be precisely what is debunked in the answers to that question. – Dave Oct 13 '23 at 03:13
  • @Dave I don't think so? – JonathanReez Oct 13 '23 at 03:19
  • 1
    What does the CLT say? – Glen_b Oct 13 '23 at 05:10
  • Practical sampling does not sound compatible with checking asymptotics... – Xi'an Oct 13 '23 at 12:43

2 Answers2

6

Technical conditions aside, the Glivenko-Cantelli theorem says that empirical distributions (data) converge to their population distributions. Therefore, sampling a huge number of points is precisely how we can start to have confidence that a distribution like the one you’ve given is bimodal instead of normal.

You seem to have the same extremely common misconception about the central limit theorem that I once had; I found Ben’s answer to be the most helpful (especially the end of the first dialogue), but the others are worth reading, too. However, the central limit theorem concerns the distribution of the sample means (transformed in a particular way), not the distribution of the original values, which is the subject of the Glivenko-Cantelli theorem.

In other words, the first sentence of the question (quoted below) is not correct, as it contradicts the Glivenko-Cantelli theorem (unless the original distribution is normal).

According to CLT, randomly selecting values from a distribution will result in a convergence towards a normal distribution.

Dave
  • 62,186
2

I've figured this out using some modeling in Python. Let's say we have a bimodal distribution:

enter image description here

We can then do 100 samples of size 2, 3, 5, 30 and 50:

enter image description here

So we're still seeing the bimodal pattern and my understanding of CLT was incorrect. It's the distribution of means that changes, not the distribution of samples:

enter image description here

  • We can then do 100 samples of size 2, 3, 5, 30 and 50 What do you mean by this? Perhaps you could share the code. – Dave Oct 13 '23 at 13:38
  • @Dave means you sample the data 2. 3, 5, 10 samples at a time, repeating the process 1000 times and plotting the means. The code is quite trivial. – JonathanReez Oct 13 '23 at 14:25
  • Then what's different in the second set of plots? – Dave Oct 13 '23 at 14:26
  • First set of plots is the aggregation of samples. Second set is the aggregation of means. – JonathanReez Oct 13 '23 at 14:49
  • Is it fair to say that the first plot is a sample of size 200, the second of a sample of size 300, the third a sample of size 500, the fourth a sample of size 3000, and the fifth a sample of size 5000? – Dave Oct 13 '23 at 14:51