Test uniformity after adding some non-uniform data in python

Question

I generate N1 numbers from uniform distribution using numpy python package from a certain interval e.g., [low,high] and combine with some non-uniform numbers from the same interval which can be repeated

Because I have only 54 unique predefined numbers between [low,high] and pull out numbers from this data which are not generate from any uniform distribution. They are generate from some chaotic map.

Now, I test the uniformity of this resultant data:

data = np.concatenate((nonuniform_data, np.random.uniform(low, high, [N1])))
_stats, p = scipy_stats.kstest(data, scipy_stats.uniform(loc=low, scale=high - low).cdf)
# p -> 0.5287917616633173

And the p-value is always between 0-1 as expected. But to guarantee the null hypothesis (the mixed data follow or come from uniform distribution) this answer suggest to check "do the p-values come from a uniform distribution?"

I run the kstest (Kolmogorov–Smirnov test) for 2000 times to get p-values and plot the p-values:

which seems quite uniform,

_stats, p = scipy_stats.kstest(p_values, scipy_stats.uniform(loc=0, scale=1).cdf)
# p -> 6.80874492574208e-07

But to guarantee the uniformity of this mixture data the p-values itself should be uniform distribution on [0,1] every time. Hence, change the mixture number N1 and apply kstest for different values of N1. Running the test for different N1 values with default significance level 0.1 using this code:

$$ \begin{array} {|r|r|} \style{font-family:inherit}{\text{KS test for p-values}} & \style{font-family:inherit}{\text{# of non-uniform data}} & \style{font-family:inherit}{\text{# of uniform data}}\\ \hline (0.5858507235230785, True) & 50 & 2450 \\ \hline (0.936998704584743, True) &100 & 9900 \\ \hline (0.416989754218446, True) & 150 & 22350 \\ \hline (0.05388683955413753, False) & 200 & 39800 \\ \hline (0.02358711070174885, False) & 250 & 62250 \\ \hline (0.1055808992463183, True) & 300& 89700 \\ \hline (0.008437370587141033, False)& 350& 122150 \\ \hline (0.0017850674918899055, False)& 400 & 159600 \\ \hline (0.0004156878457119884, False)& 450& 202050 \\ \hline (0.001655399708003342, False)& 500 & 249500 \\ \hline (0.00036075938599595237, False)& 550& 301950 \\ \hline (0.0014847547840858462, False) & 600& 359400 \\ \hline (0.00039739769639024827, False)& 650& 421850 \\ \hline (0.0002500280144076814, False)& 700& 489300 \\ \hline (9.236813390614254e^{-05}, False)& 750& 561750 \\ \hline (1.8224958240829835e^{-05}, False) & 800 & 639200 \\ \hline (6.3006801226360525e^{-06}, False) & 850& 721650 \\ \hline (2.7592718787751226e^{-06}, False)& 900& 809100 \\ \hline (2.46888962111129e^{-07}, False) &950& 901550 \\ \hline (1.739925789055794e^{-07}, False)& 1000 &999000\\ \hline \end{array} $$

I was wondering, why the p_values not pass the kstest. As I increase the uniform data compare to non-uniform data in every case. Is there any ratio rule $\left(\frac{\text{non-uniform data}}{\text{uniform data}}\right)$ which can guarantee the mixture data is always pass the uniformity test or seem generated from uniform distribution?

It is very difficult to understand what you are doing or asking. Are you perhaps misinterpreting low p-values as being not significant? — whuber, May 08 '22 at 13:01
I was thinking the p-values should be uniform for null hypothesis from here and here. Do low p-values also mean significant in my case? Sorry, I am not super familiar with statistics. I was trying to understand how much non-uniform data are allowed to mixture with uniform data to make it still uniform @whuber. Do I need to add more details? Please let me know if that's case. — falamiw, May 08 '22 at 13:23
As far as I understand this, your underlying distribution is not uniform, so on what basis do you expect the p-values of a uniformity test to be uniform? — Christian Hennig, May 08 '22 at 14:53
I pull a small number of data from the underlying distribution and mixture with a large data from uniform distribution. I was guessing adding the small perturbation won't change the distribution of the larger data. @ChristianHennig — falamiw, May 08 '22 at 16:11
Now, I just want to know in what proportion should I mixture to guarantee the mixed data should be uniform @ChristianHennig — falamiw, May 08 '22 at 16:25
A small perturbation will obviously also perturb uniformity of the p-value, albeit only to a small extent. With bigger datasets even a small change of the distribution will normally lead to a clear non-uniformity of the p-value (obviously it depends on how exactly your perturbation is defined). I'm afraid there can't be a guarantee of the kind you're looking for. — Christian Hennig, May 08 '22 at 16:52
Any perturbation changes the distribution, so I'm surprised that you expect no change. — Christian Hennig, May 08 '22 at 16:53
Actually, Chaotic map itself generate (pseudo-)random number. I was thinking adding small perturb which itself were (pseudo-)random number on the same interval could give the the result what I expected @ChristianHennig — falamiw, May 08 '22 at 18:18

Test uniformity after adding some non-uniform data in python

0 Answers0