6

I prove a some theorem under the assumption that some random variable X is Gaussian. Now in practice, in my experiments section I have real-world samples from X, and I want to claim that indeed these samples form approximately a Gaussian distribution, therefore the conditions of the theorem approximately hold. Of course in real life nothing is exactly a Gaussian, only approximately.

Now there are some normality tests such as [link], but they all have the null hypothesis that the samples were drawn from a Gaussian distribution, therefore I can at most fail to reject this hypothesis (when p>0.05), but it still doesn't mean that my samples were drawn from an almost Gaussian distribution.

Any ideas what is the standard way to quantitatively show that my samples were drawn from an approximately Gaussian distribution?

cruvadom
  • 195

4 Answers4

8

This is not a full answer to the question, because it denies the premise of the question.

Some things can be very close to Gaussian without basic properties that follow from normality holding. For example, consider a mixture that's $(1-\varepsilon)$ from a standard normal and $\varepsilon$ from a standard Cauchy, for arbitrarily small but fixed $\varepsilon>0$.

Then (for example) our distribution has no finite positive integer moments (i.e. none of the usual moments are finite). So while it may often be the case that for a wide class of distribution functions more or less similar to the normal some property may hold approximately, you may be able to get arbitrarily close (in a particular sense, such as in terms of the cdf) and not have a good many simple properties that hold for the normal. Here's an example showing population distribution and density functions for a normal and a normal mixed with a little bit of Cauchy:

Normal cdf vs 0.99 Normal+ 0.01 Cauchy cdf showing essentially no distinguishable difference

NB: the $\epsilon$ on the plot - the bound on the supremum of the absolute difference in cdfs - is a smaller value that $\varepsilon$, the mixture weight on the contaminating distribution.

Normal pdf vs 0.99 Normal+ 0.01 Cauchy density showing essentially no distinguishable difference

They're pretty hard to distinguish visually at the population level.

Some properties will hold just fine in a situation like this, but others will not.

This suggests that you have to be very careful about how sensitive the properties are to the particular kinds of non-normality you might have.

Glen_b
  • 282,281
  • Does the $\varepsilon$ on your plots have the same value as your $\epsilon$ in the text, or do you just mean two small numbers? – Dave Jan 17 '24 at 00:23
  • Thank you for pointing that out. They're not meant to be the same though the statement is plainly true if they were the same. I should have used distinct symbols. The epsilon ($\epsilon$) in the text is an unspecified small constant. The epsilon I had intended to use ($\varepsilon$) in the plot is a distinct value from that. The $\epsilon$ in the plotted example is $0.01$ for the contaminating mixture weight for illustration and the resulting bound on the supremum on the absolute difference in cdfs ($\varepsilon$) is a different, much smaller number than $\epsilon=0.01$. I will need to edit. – Glen_b Jan 17 '24 at 01:53
  • 1
    My eye kept seeing $\varepsilon$ on the plot for some reason but I probably should have chosen $\delta$ for that. I've swapped the initial $\epsilon$ to $\varepsilon$ which is not ideal but it saves me redoing the plot and at least they're not the exact same symbol now. I'll ponder a bigger edit. I appreciate the helpful question there. – Glen_b Jan 17 '24 at 02:01
6

A normal QQ plot is always a good idea in view of the problematic nature of normality tests. The QQ plot compares sample quantiles with theoretical quantiles and so if your data lie on a straight line or approximately a straight line then you can claim approximate normality. See also How to interpret a QQ plot.

JohnK
  • 20,366
  • But this is not a test, and can't actually quantitatively show the Gaussian assumption. It's more like a visual check. – SmallChess Mar 10 '17 at 06:15
  • @StudentT The OP is well aware of the normality tests but is looking for something different, at least according to my reading of his second paragraph. – JohnK Mar 10 '17 at 07:05
  • Is there any systematic research/reference that shows that people do better with a QQ plot than with any formal normality test? I doubt it. (For sure it will strongly depend on who the people are.) – Christian Hennig Jan 17 '24 at 00:02
1

The question is a very good one as you realise that not rejecting normality in a normality test does not guarantee you that the "true" distribution is "approximately Gaussian".

Unfortunately my answer to your question will be disappointing and to some extent worrying.

If you wanted to show convincingly, backed up by theory, that anything is truly "approximately Gaussian", you'd need to define formally what that means. Such a definition would normally state that the data come from some true distribution $P$ and involve a distance measure $d$ between distributions. Then it would state that there exists a normal distribution $Q$ with appropriate parameters so that $d(P,Q)$ is small enough.

Unfortunately, regardless of what $d$ is, no such thing can be shown for real data, because for starters the existence of a true underlying distribution $P$ is an idealisation, and cannot be secured to be true in reality. With the usual frequentist interpretation of probability, "real existence" of any probability would require infinite repetition of the data generation (in fact not only infinite but also "random" in a certain sense, which is hard to define as has been discussed in the literature on foundations of probability for a long time), which in reality does not happen. Any repetition is finite, and it cannot be guaranteed to be "random". So the very existence of a true probability distribution cannot be guaranteed, which would be a requirement for making sure that this distribution is "approximately something".

What you can do is to make sure that $d(P_n,Q)$ is small enough, where $P_n$ is the empirical distribution of your observed data points. This is in fact what some normality tests do. For example, Kolmogorov-Smirnov will reject normality if $d(P_n,Q)$ is too large, where $d$ is the Kolmogorov distance.

This unfortunately is weaker than what you want, because it does not imply that theory for a Gaussian random variable (i.i.d. draws from it) will approximately apply. See the answer of @Glen_b for an example; also you cannot secure independence (see https://doi.org/10.1007/s00362-023-01414-3 ), but any theory will fail if independence is critically violated (the very concept that $P_n$ represents $P$ well relies on i.i.d. sampling through Glivenko-Cantelli's Theorem). "Everything depends on everything else" in the real world said Thich Nhat Hanh, and I think he was right about this.

Unfortunately, despite the limitations of normality tests, you can therefore hardly come closer to what you want.

Ultimately we need to resort to Popper's idea of falsification. We cannot positively make sure that our theory holds, we can only put it to certain tests and see whether it is rejected.

As a side remark, it is good to have in mind also that for much theory that is based on normality, only certain deviations from normality (such as gross outliers, strong skewness, issues with independence) are problematic, whereas other deviations don't destroy results, at least not with reasonably large samples, due to the Central Limit Theorem. Standard tests of normality are not necessarily sensitive to detect the right issues, namely those that really cause trouble.

Furthermore, there is what I call "misspecification paradox", which says that if you do anything based on your data in order to decide whether data are "normal enough" or not, and then only if data look normal enough you do something that assumes normality in theory, this adds an additional problem. This is because standard theory does not take into account that there has been data based selection of what you do, and that in itself constitutes a violation of i.i.d. normality. See https://doi.org/10.52933/jdssv.v3i3.73

So the situation is a mess really, but of course still probability modelling helps as long as it helps...

0

"Approximately" is an essential part of any assumption because a Gaussian distribution extends from negative infinity to positive infinity. Most variables in most scientific fields have logical or physical limits, so can't possible be exactly Gaussian.

As you point out, normality tests don't ask if a distribution is close enough to Gaussian. I think this is a real gap in the statistical toolchest, so is an opportunity for someone to develop new tests or guidelines.