0

The question is related to another previous question.

I am interested in the process described in the text below the links. For curiosity, I ask myself whether it is a good option or not and the purpose of this question is to discuss about it. None of the following links reply to this question:

We have available a dataset with $60'000$ data and need to ensure that they come from a normal distribution. As it is well known, it is unecessary to test on all the $60'000$ data as it will greatly increase the power of the test and lead to rejecting with certainty the null hypothesis 'the data come from a normal distribution'.

What it is usually done is testing on a smaller set of data drawn randomly from the larger dataset. However, this may not be representative enough, and since I have $60'000$ data at hand, I used another approach to have a more accurate result.

I randomly drew with replacement $50$ subsets of data from the larger dataset and for each of these subsets, i performed a normality test with a significance level of $5\%$. At the end, we must observe that no more than $5\%$ of the test failed in order to accept the null hypothesis.

Critics about this way to test for normality on a big dataset are welcome. I am asking myself if it is useless to do it or it actually give a more accurate result.

I do not want to discuss why I am doing a normality test and if it is useful or not in my situation. The goal here is just to think about the process I have just described.

lulufofo
  • 482
  • 2
    Why do you need the data to be normal? – Stephan Kolassa Mar 29 '23 at 08:23
  • see here https://stats.stackexchange.com/q/2492/56940 – utobi Mar 29 '23 at 08:24
  • I have already read the post you mentioned @utobi, but it is not answering my question. – lulufofo Mar 29 '23 at 08:27
  • As I said, I do not want to discuss about the utiliy of testing for normality or not. I am just more concerned about the process described. – lulufofo Mar 29 '23 at 08:28
  • 1
    Close-voters: this is not a duplicate. OP is not asking about the best way of doing normality testing, but for comments on their specific approach to it. – Stephan Kolassa Mar 29 '23 at 10:04
  • 1
    I'm sorry, this is hard to answer well without knowing more more about your case and what you want to learn from a normality test. You seem to think that using the full dataset will make the test too powerful which suggests you are interested in departures from normality beyond some (unspecified) threshold. – mkt Mar 29 '23 at 10:13
  • So I believe you can understand your procedure with a binomial test.if my probability is 5% on a single test, what is the probability of getting 5 or more out of 100 [changed 50 to 100 - to have integer] – seanv507 Mar 29 '23 at 10:26
  • so using 100 tests, instead of 50, I get that you would have a 56.4% chance of getting 5 or more tests failing out of 100. – seanv507 Mar 29 '23 at 10:34
  • @☺mkt, except if your data are coming form an exact normal distribution (which is impossible for data from real world) the normality test will reject them for a too large sample size. This is because no real world process are truly normal. – lulufofo Mar 29 '23 at 11:27
  • 1
    I agree that no real world process is truly normal, which brings us back to the basic question: what are you trying to learn from this? Are you just testing if you can generate normally distributed random numbers? – mkt Mar 29 '23 at 12:45
  • 1
    However, your claim that the normality test will fail for data simulated from an exact normal distribution is incorrect, as one of the threads you linked to shows: https://stats.stackexchange.com/a/414403/121522 – mkt Mar 29 '23 at 12:46
  • Sorry, I did not mean that. I need data to be normal to do a capability analysis with capability indices that are sensitive to departure from normality. – lulufofo Mar 29 '23 at 12:48

0 Answers0