A Process for Testing Normality on a big Dataset

Question

The question is related to another previous question.

I am interested in the process described in the text below the links. For curiosity, I ask myself whether it is a good option or not and the purpose of this question is to discuss about it. None of the following links reply to this question:

We have available a dataset with $60'000$ data and need to ensure that they come from a normal distribution. As it is well known, it is unecessary to test on all the $60'000$ data as it will greatly increase the power of the test and lead to rejecting with certainty the null hypothesis 'the data come from a normal distribution'.

What it is usually done is testing on a smaller set of data drawn randomly from the larger dataset. However, this may not be representative enough, and since I have $60'000$ data at hand, I used another approach to have a more accurate result.

I randomly drew with replacement $50$ subsets of data from the larger dataset and for each of these subsets, i performed a normality test with a significance level of $5\%$. At the end, we must observe that no more than $5\%$ of the test failed in order to accept the null hypothesis.

Critics about this way to test for normality on a big dataset are welcome. I am asking myself if it is useless to do it or it actually give a more accurate result.

I do not want to discuss why I am doing a normality test and if it is useful or not in my situation. The goal here is just to think about the process I have just described.

I have already read the post you mentioned @utobi, but it is not answering my question. — lulufofo, Mar 29 '23 at 08:27
As I said, I do not want to discuss about the utiliy of testing for normality or not. I am just more concerned about the process described. — lulufofo, Mar 29 '23 at 08:28
Close-voters: this is not a duplicate. OP is not asking about the best way of doing normality testing, but for comments on their specific approach to it. — Stephan Kolassa, Mar 29 '23 at 10:04
I'm sorry, this is hard to answer well without knowing more more about your case and what you want to learn from a normality test. You seem to think that using the full dataset will make the test too powerful which suggests you are interested in departures from normality beyond some (unspecified) threshold. — mkt, Mar 29 '23 at 10:13
So I believe you can understand your procedure with a binomial test.if my probability is 5% on a single test, what is the probability of getting 5 or more out of 100 [changed 50 to 100 - to have integer] — seanv507, Mar 29 '23 at 10:26
so using 100 tests, instead of 50, I get that you would have a 56.4% chance of getting 5 or more tests failing out of 100. — seanv507, Mar 29 '23 at 10:34
@☺mkt, except if your data are coming form an exact normal distribution (which is impossible for data from real world) the normality test will reject them for a too large sample size. This is because no real world process are truly normal. — lulufofo, Mar 29 '23 at 11:27
I agree that no real world process is truly normal, which brings us back to the basic question: what are you trying to learn from this? Are you just testing if you can generate normally distributed random numbers? — mkt, Mar 29 '23 at 12:45
However, your claim that the normality test will fail for data simulated from an exact normal distribution is incorrect, as one of the threads you linked to shows: https://stats.stackexchange.com/a/414403/121522 — mkt, Mar 29 '23 at 12:46
Sorry, I did not mean that. I need data to be normal to do a capability analysis with capability indices that are sensitive to departure from normality. — lulufofo, Mar 29 '23 at 12:48

A Process for Testing Normality on a big Dataset

0 Answers0