4

I would like to test following. Suppose I have a normal distribution with mean 1.5 and sigma 0.5 on interval [0, 3]. Python code:

import numpy as np
import matplotlib.pyplot as plt

mu = 1.5 # mean sigma = 0.5 # standard deviation n = 1000 # number of samples

generate normally distributed random numbers with mean mu and standard deviation sigma

samples = np.random.normal(mu, sigma, n)

shift and scale the samples to the range 0 to 3

samples = np.clip(samples, 0, 3)

Histogram looks like this:

enter image description here

But now I add one peak around x = 1.96 and my histogram looks like this now:

enter image description here

Python code for generating this peak is:


n = 70  # number of samples

generate uniformly distributed random numbers between 0 and 1

added_samples = np.random.rand(n)

scale the samples to the range 1.96 to 1.99

added_samples = 1.96 + added_samples * (1.99 - 1.96)

all_together = np.append(samples, added_samples)

My question is, if there is a statistical test or algorithm that can check for normally distributed data with "one peak"? Some modification of shapiro-wilk test or something.. Thanks a lot.

EDIT: What we know beforehand:

  1. Peaks can be only in three points, lets say x = 1.5, x = 1.96 or x = 2.5. We would like to test if there is really a peak in any of these x.
  2. We know the mean and standard deviation of the new sample = that is after the peak is present.
  3. The additional data are uniformaly distributed around the interval [x, x + 0.03] (that is [1.5,1.53], [1.96,1.99] or [2.5, 2.53])
vojtam
  • 155
  • 1
    Strictly speaking, the normal distribution has unbounded support, so your clipping operation results in a truncated normal distribution. Are the points of truncation known beforehand, or do they need to be estimated from the data? – Stephan Kolassa Mar 27 '23 at 07:26
  • What is your goal with this? Your data is clearly not normal, so this seems like but other than that piece which doesn't fit with my hypothesis, how does the rest look?. – user2974951 Mar 27 '23 at 07:26
  • My answer to Normality test after rounding could easily be adapted to your case, at least if the truncation is known: either do the "inter-ocular trauma test", or a parametric bootstrap of the Shapiro-Wilk test (or any other goodness of fit test) under the null distribution of a truncated normal. – Stephan Kolassa Mar 27 '23 at 07:30
  • @vojtam, maybe D'Agostino's $K^{2}$ Test will help you. It's described here and can be found as scipy.stats.normaltest. – perepelart Mar 27 '23 at 07:34
  • @StephanKolassa Thank you for your comments. Yes, suppose I would like to test if there is a peak in x = 1.96 , so I know them beforehand. I am little bit confused with the truncated normal distribution as you write. Why does my case lead to truncated normal distribution? – vojtam Mar 27 '23 at 08:34
  • Ah, so you even know where the peak will be? That changes matters a bit, then it might be possible to create a more powerful test. Can you be as explicit as possible as to what information are known beforehand and which need to be estimated from the data? I'm thinking of the mean, the standard deviation, the upper and lower truncation point, and any knowledge you have of the additional data (in your example, you know that they are uniform in [1.96,1.99])? It does sound like what you want to test is whether your data come from a mixture of a truncated normal and a uniform distribution. – Stephan Kolassa Mar 27 '23 at 08:37
  • To your question about the truncation: the normal distribution is spread out over the entire real axis. Your np.clip command truncates it, so what you end up with is not normal any more. (Actually, since np.clip does not remove data outside the interval, but maps these data points to the interval boundary, what you end up with is not a truncated normal, but a mixture of a truncated normal with two point masses.) Whether you can disregard this effect or not will depend on how much you cut off. – Stephan Kolassa Mar 27 '23 at 08:43
  • @StephanKolassa Thanks for you answer. See my edit please. – vojtam Mar 27 '23 at 08:47
  • Great, thanks, we are making progress. Do you know the mean, standard deviation and truncation points of the original data before the "additional data" comes in, or do we need to estimate these from the data? – Stephan Kolassa Mar 27 '23 at 08:49
  • 1
    @StephanKolassa We know only the truncation points of the original data, because it cannot change just because of adding additional peaks. Unfortunately, we dont know the mean and standard deviation of the original data, so we need to estimate these. – vojtam Mar 27 '23 at 09:02
  • 1
    Can I check, what is the goal of testing for normality in this context? I am reluctant to recommend anything based on Shapiro-Wilk or indeed null hypothesis significance testing in general. What is the point of of seeing whether a fallible test successfully rejects the null hypothesis when we already know for a fact that the null hypothesis is false (a normal distribution that is then clipped and has another peak inserted is by definition no longer a normal distribution)? If the p-value is low, it tells us nothing we didn't already know, and if it's high, then it's a type II error. – Singularity24601 Mar 27 '23 at 09:39

2 Answers2

5

First, it seems like you should be able to identify the particular alternative hypothesis, i.e., which of the three potential intervals the "additional" data would come from, simply by looking which of the three intervals contains the most excess data over what we would expect under the null hypothesis of a normal distribution.

Next, the probably best way to continue would be a likelihood ratio test between two competing models for the data.

  1. A mixture between (a) a mixture of a truncated normal distribution and two point masses at 0 and 3 and (b) a uniform distribution on a prespecified interval.
  2. The mixture between the truncated normal and the two point masses, part (a) without (b) above.

Unfortunately, the likelihood functions involved are a bit icky... maybe someone with more time could set these up. It might be best to bootstrap the test itself once we have the likelihoods.


A much simpler alternative would be to use a two-sample Kolomogorov-Smirnov test: compare your data against a simulated data sample under the null hypothesis of that mixture between a truncated normal and two point masses with no "additional data".

Here is an example. I will use R, simply because I am more fluent in that.

# a function to simulate data
simulate_obs <- function(n_orig,m_orig,sd_orig,n_add,a_add,b_add) {
    obs <- c(rnorm(n_orig,m_orig,sd_orig),runif(n_add,a_add,b_add))
    pmin(pmax(obs,0),3)
}

simulation parameters

n_orig <- 1000 m_orig <- 1.5 sd_orig <- 0.5 n_add <- 70 a_add <- 1.96 b_add <- 1.99

simulate your "real" data

set.seed(1) obs_orig <- simulate_obs (n_orig,m_orig,sd_orig,n_add,a_add,b_add)

simulate under the null hypothesis

null_sample <- simulate_obs(length(obs_orig),mean(obs_orig),sd(obs_orig),0,0,3)

compare your "real" data to the simulated data:

ks.test(obs_orig,null_sample)

In the present case, this gives $p=0.23$, because the "real" data is quite consistent with coming from a "real" uncontaminated distribution.

Stephan Kolassa
  • 123,354
  • Thanks for your answer. The idea with the likelihood ration test is interesting, I will think about it. KS test is not really what I need, because it just checks if two samples follow the same distribution. In other words, it does not care about my peaks in data. But again, thanks a lot :) – vojtam Mar 27 '23 at 09:57
  • 1
    The KS test between your data and simulated data under the null hypothesis does care about the peak, because the "reference data" explicitly does not contain the second peak. See how I define null_sample. – Stephan Kolassa Mar 27 '23 at 09:59
  • Ah, sorry, I see, thanks! – vojtam Mar 27 '23 at 10:00
  • 1
    @StephanKolassa I completely agree with your answer, just out of curiosity, why use the KS test when other tests have advantages? -- see my new question at https://stats.stackexchange.com/questions/610924/why-is-the-kolmogorov-smirnov-test-so-popular-when-other-tests-have-advantages – Number Mar 27 '23 at 20:38
  • 2
    @Number: to be honest, it's the first one that comes to mind. I agree that the Anderson-Darling test in particular would be an alternative. – Stephan Kolassa Mar 28 '23 at 06:01
  • 1
    @Number the KS test is less powerful if your hypothesis about the type of distribution is correct. – Sextus Empiricus Mar 28 '23 at 12:39
3
  1. Compute the probability for the set $S=[1.5,1.53] \cup [1.96,1.99] \cup [2.5, 2.53]$ for a truncated normal with the given parameters, say $p_0$.

  2. Run a binomial test for the $H_0:\ p=p_0$ against the alternative $p>p_0$ on data that is 1 for observations in $S$ and 0 for observations not in $S$.

One can probably improve this a bit if you know that there will be only one peak; what I propose here works also (and is maybe in some sense optimal) if you have peaks at two or all three of these locations.

  • Thanks, but how do you compute the step 1? We dont know the parameters (mean and standard deviation) of the original truncated normal distribution, right? We only know how the new distribution with a peak looks like. – vojtam Mar 28 '23 at 07:08
  • @vojtam The question says "Suppose I have a normal distribution with mean 1.5 and sigma 0.5 on interval [0, 3]", so I thought you know them? – Christian Hennig Mar 28 '23 at 16:30
  • unfortunately not, please look at the last paragraph What we know beforehand: – vojtam Mar 29 '23 at 09:55
  • 1
    @vojtam OK, sorry, my answer was then based on not fully understanding what information is known and what isn't. In principle it should be possible to estimate the parameters of a normal distribution from the data excluding "critical" intervals using the information what is excluded, but this is unfortunately a nonstandard problem and I don't have a solution for it right now. – Christian Hennig Mar 29 '23 at 10:33