4

Suppose we have a random variable $Y$ that generates data $(Y_1,...,Y_N)$ for $N \in \mathbb{N}$ big, i.e. $N \geq 10'000$. The mean of the random variable $Y$ is $\mu = 0.52$ and the standard deviation is $\sigma = 0.002$.

We do not have access to the data $(Y_1,...,Y_N)$ directly, rather a device measures the outcomes of $Y$ and give us the data $(\tilde{Y_1},...,\tilde{Y_N})$ such that $\tilde{Y_i}$ is rounded at the third decimal for $i = 1,...,N$. As an example, suppose that $Y_i = 0.5236$, then $\tilde{Y_i} = 0.524$.

The goal is to know if the random variable $Y$ is normally distributed, by applying a normality test for example. The problem is that we can only do the normality test on the data
$(\tilde{Y_1},...,\tilde{Y_N})$ that are not normally distributed anymore because they were transformed and the rounding is 'too strong' in regard with the standard deviation implying a strong deviation from normality.

I have found an article of minitab explaining that the Ryan-Joiner test can be an alternative and works good in case of rounding https://blog.minitab.com/en/the-statistical-mentor/normality-tests-and-rounding.

However, it does not always work in my case. I have also made some simulations for $N = 30'000$ (which corresponds to my practical case) to illustrate what happen. enter image description here

In red is $(Y_1,...,Y_N)$ if $Y$ is a normal distribution and in black is $\tilde{Y}$. As we can see, the distribution of $\tilde{Y}$ has a form of a bell shape but with a big peak.

So, I tried to take a subsample (of size 50 and 100) of the $\tilde{Y}$ and do a normality test on it, it works for some cases. I was wondering if it is a good approach because in any case, a normality test on a sample of size 30'000 can rarely be passed.

Has anyone any idea how to overcome this problem?

lulufofo
  • 482
  • 4
    Unfortunately, there is no test that will tell you that your variable is normally distributed. Why is it important that the variable is normally distributed in the first place? See also: https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless – COOLSerdash Mar 23 '23 at 08:14
  • 1
    Note that any formal significance test (if there is any, see COOLSerdash's comment) will pick up on even tiny violations of the null hypothesis if your sample is large enough, even if the violation is not "clinically" significant. – Stephan Kolassa Mar 23 '23 at 08:27
  • Thank you for the answers. Yes of course, my sample size is huge and it is a problem. This is why I was wondering if it is reasonable to check normality on a smaller sample size. Also I need normality to do a capability analysis thanks to the capability indices. – lulufofo Mar 23 '23 at 08:28

2 Answers2

6

Here is an informal test you could run: collect your rounded observations $(\tilde{Y_1},...,\tilde{Y_N})$ and plot the "peaky" histogram as you did. Derive the mean and standard deviation from the rounded data.

Then: simulate normally distributed data using the estimated mean and SD. Round the simulated data. Plot the corresponding histogram.

Do the simulation 19 times, ending up with 19 simulated histograms. Lay these out in a $5\times 4$ grid, and insert the "true" histogram at a random spot.

Then go and ask people whether they can identify the "off" histogram. If they can't, this is at least some evidence that the deviation between the simulated data and the actual observations can't be "too strong".

This has been called an "inter-ocular trauma test for significance".

Below is an example. Can you spot which distribution comes from gamma distributed original data (which I intentionally generated to have a mean and standard deviation close to what you report for your data?)? If so, is the deviation strong enough to pose any problems in the subsequent analysis you need this information for?

histograms

R code (with the solution):

nn <- 30000
set.seed(1)
true_data <- round(rgamma(nn,40000,75000),3)

simulated_data <- replicate(19,round(rnorm(nn,mean(true_data),sd(true_data)),3)) all_data <- cbind(simulated_data[,1:13],true_data,simulated_data[,14:19])

opar <- par(mfrow=c(5,4),mai=c(.5,.1,.1,.1)) for ( ii in 1:20 ) { hist(all_data,breaks=seq(min(all_data)-0.0005,max(all_data)+0.0005,by=0.0001), yaxt="n",main="",xlab="",ylab="") } par(opar)

You could of course turn something like the "test" above into a formal parametric bootstrap: use a "reasonable" test (e.g., the Shapiro-Wilk test for normality), calculate the test statistic (and discard the p value, it's useless). Then simulate many normally distributed datasets with the same mean and standard deviations, round them, and again calculate the Shapiro-Wilk test statistic. Finally, derive a p statistic, by taking the empirical cumulative distribution function (ECDF) of the bootstrapped test statistics and evaluating this at the test statistic derived from the true data.

In the example above, this gives you result of $p=0.11$. Any significant result here would, again, be a result of your large sample size, and only give you statistical significance, not clinical significance, which you can only assess by placing the entire exercise in the larger context.

Here is a histogram of the Shapiro-Wilk test statistics from 1000 bootstraps, together with the test statistic from the "true" data (from an underlying gamma distribution) inserted as a red line.

histogram of bootstrapped test statistics

More R code (where I took a sample of size 5000 from the original data of size 30000 because R's shapiro.test can't handle more data):

shapiro_true <- shapiro.test(sample(true_data,5000))$statistic
shapiro_sim <- replicate(1000,
 shapiro.test(round(rnorm(5000,mean(true_data),sd(true_data)),3))$statistic)
hist(shapiro_sim,xlim=range(c(shapiro_sim,shapiro_true)))
abline(v=shapiro_true,col="red",lwd=2)

ecdf(shapiro_sim)(shapiro_true)

Stephan Kolassa
  • 123,354
  • Thanks for the answer. I was wondering if this can be a sufficient proof for submitting to official authorities. But probably yes. I will also do some simulation on the indices I need to compute from the data and see if there is a strong difference between the one calculated with normal data and with rounded data. – lulufofo Mar 23 '23 at 08:58
  • by the way, I do not see where the gamma distribution histogramm is, this shows that the visual test is not really helpful in detecting departure from normality, no? – lulufofo Mar 23 '23 at 09:00
  • 1
    It will likely not be enough to satisfy regulatory authorities, no. That you can't detect the "outlier" histogram (neither can I!) IMO shows that gamma distributed original data is "very close" to normally distributed data - likely close enough that it will not be a problem in your subsequent analysis. – Stephan Kolassa Mar 23 '23 at 09:03
  • I really need a strong scientific explanation. – lulufofo Mar 23 '23 at 09:05
  • 2
    I edited my answer. You can do a parametric bootstrap of a standard goodness of fit test, with the null hypothesis based on a normal distribution hypothesis. This test gives a significant deviation from normality. But again, this is a case of statistical significance likely not being the same as clinical significance. – Stephan Kolassa Mar 23 '23 at 09:15
  • Thanks! I do not really understand what does ecdf(normality-statistic) give you?

    Just for precision, I am not dealing with clinical data but rather with data from manufacturing process (filling of syringes).

    – lulufofo Mar 23 '23 at 12:12
  • 2
    ecdf is the empirical cumulative distribution function. ecdf(shapiro_sim)(shapiro_true) creates the ECDF of shapiro_sim and evaluates it at shapiro_true. That is, it tells you what proportion of the shapiro_sim vector is smaller than or equal to shapiro_true, which gives you a one-sided p value. It might be more appropriate to do a two-sided test here. Also, when I write of "clinical" significance, that's shorthand for "what really matters" - depending on context, it could also be "business benefit" or whatever. – Stephan Kolassa Mar 23 '23 at 12:19
  • Yes but the p-value that we obtain with the Shapiro-Wilk test is based under the assumption are normally distributed. I am not sure what this p-value will give us. I do not understand why I can use this p-value to assess that my data are normally distributed or come from a normal distribution. – lulufofo Mar 23 '23 at 12:26
  • 2
    That is exactly why I recommended to disregard the p value from the original S-W test. Instead, I am recommending you do a parametric bootstrap. We don't compare the S-W test statistic to its theoretical distribution under the "normal" null hypothesis of a normal distribution of the data. Instead, we simulate the distribution of the S-W test statistic under the null distribution of the data coming from rounded normal realizations. – Stephan Kolassa Mar 23 '23 at 12:46
  • Ok, so what you are saying is that the test we are doing will allow to tell us if the data come from a rounded normal distribution, right? – lulufofo Mar 23 '23 at 13:07
  • 2
    Exactly. (With the caveat about statistical vs. "clinical" significance.) – Stephan Kolassa Mar 23 '23 at 15:26
1

You could run a chi-squared test. A chi-squared test would partition data into categories. You can use as categories a fixed number of rounded values that you expect with large enough probability, say $k$, plus "anything larger" and "anything smaller", so you have $k+2$ categories. Estimating mean and variance, you can compute the expected probabilities for all these categories, and then run a standard chi-squared test with k-1 degrees of freedom (number of categories $k+2$ minus one is standard and you have to subtract 2 degrees of freedom for having estimated two parameters). The good thing about this is that it is based on rounding by construction, so the very fact that you have rounded data will not affect this.

Note however that nothing in real life is ever normally distributed, so there is no way to "know" that anything is normally distributed. Neither do you ever "need" normality, as most things that are based on normality work approximately on non-normal data (and those that don't shouldn't be touched with a bargepole). It is only ever necessary (and possible) to detect problematic deviations from normality that will mislead your analysis, and this always depends on the analysis you want to do and how you want to interpret it.

The chi-squared (or any) test of normality will not by any means prove normality; it will rather only ask whether data cannot be statistically distinguished from normality, according to what the test statistic measures (which in this case does not measure distortion by rounding). Note in particular that much inference that supposedly "assumes" normality will still work well with many non-normal distributions, so that in your situation normality may be technically rejected but inference may still be fine (although occasionally it may not; but this is a different issue than what a normality test addresses).

  • Thanks for the comment Christian Henning. i understand and already read multiple article and document about the normality assumption and test. The problem is that the authorities often ask for a normality check with a normality test. Unfortunately I cannot get rid of this step. I was also asking myself if it is a good idea to use a statistical test with this amount of data (sample size bigger than 10'000). – lulufofo Jul 25 '23 at 08:25
  • 1
    @lulufofo It seems that the problem here is not only that the authorities require a formal test, but also that the formal test doesn't give you the result you want, namely non-rejection. Your suspicion may be right that no test won't reject normality in your setup with that many observations (which is what a test is supposed to do if the underlying distribution is in fact not normal, as pretty much always in life). Maybe you can justify what you do on your data referring to the Central Limit Theorem assuming that this applies to what you want to do that "requires" normality? – Christian Hennig Jul 25 '23 at 10:51
  • 1
    This would imply that for large n what you do behaves approximately in the same way as if your data were normal even if in fact they aren't. – Christian Hennig Jul 25 '23 at 10:52
  • But why could I involve the central limit theorem? Isn't it only for the mean, i.e. the distribution of the mean (or standardized data) converges in distribution to a normal distribution? – lulufofo Jul 25 '23 at 11:09
  • 1
    @lulufofo It depends on what exactly you want to do, but asymptotic normality derived from the CLT applies to many estimators, for example general maximum likelihood. Also some other results that hold precisely under normality hold approximately/asymptotically for non-normal large samples such as chi squared for likelihood ratio statistics. – Christian Hennig Jul 25 '23 at 14:13
  • Thanks for your answer. Is there any documents I can find that shows which statistics or estimators are 'affected' by the CLT and which tests can still be valid with a sample size big enough?

    (I am mainly doing capability analysis with computing the indices Ppk and Cpk.)

    – lulufofo Jul 25 '23 at 14:18
  • @lulufofo I don't know about these unfortunately. You may want to ask another question here. – Christian Hennig Jul 25 '23 at 14:51