3

How can I test if a sample was created from a specific discrete distribution.

For example, if I have the following distribution

1- 0.2  
2- 0.5 
3- 0.3

and I get the following sample, [2,2,2,1,1,3,2,2,1] ( the order is not important )

How can I test if the sample was created from the distribution? Or how can reject the hypothesis that sample came from the distribution.

Thanks.

EDIT

What do you think about the following python code?

In the code I create 100000 samples

each sample is with size 9 and is from the discrete probability that I wrote above, for example Sample1 = [1,2,2,1,3,1,1,1,2]

Sample2 = [2,1,2,2,3,2,1,1,2]

Sample3 = [1,2,2,1,2,1,2,1,1] ...

Now I count how many repetition I have from each sample

( The order does not count. [1,1,1,1,1,1,2,2,2] == [2,2,1,1,1,1,1,2,1])

I will get the probability for each sample. after I have the probability of each sample I sum all the probabilities that are lower then the probability of my sample ( The original sample in the question ) .. this is my p value? ( I got the idea from this link http://en.wikipedia.org/wiki/Multinomial_test )

from collections import Counter
import scipy


NumberOfRuns = 100000.0
z = [tuple(sorted(random.choice(3,9,p=[0.2,0.5,0.3])+1)) for i in arange(NumberOfRuns)]  # Create the sample

zz = Counter(z) # Count how many there are from each option.
Psig = 0


# Following the direction in this link  http://en.wikipedia.org/wiki/Multinomial_test
for i in sort(zz.values()):  #Check the sum of all probabilities that are below my sample probability 
    if i>z.count((1,1,1,2,2,2,2,2,3)):
        print 'The sample is more common then ', Psig/NumberOfRuns, ' of all other samples, If this is above 5% you can not reject the hypothesis '
        break
    Psig+=i
Oren
  • 143
  • 3
    Have you heard of the chi-square test? – Xi'an Jan 16 '15 at 13:12
  • 2
    I strongly recommend you to read things related to hypothesis testing, starting from here: https://en.wikipedia.org/wiki/Lady_tasting_tea – chuse Jan 16 '15 at 13:15
  • 5
    I am not a statistician even though I post comments and answers here. If your discrete distribution has values ${x_1, x_2, \ldots, x_n}$ and your sample includes some $x \notin {x_1, x_2, \ldots, x_n}$, you can be sure that the sample did not come from the desired distribution. Otherwise, if all the sample values are from the set ${x_1, x_2, \ldots, x_n}$, there is no way that you can be sure that the sample did not come from the distribution: you might end up being fairly confident (say $99.999%$ confident) but there is a huge gap between confident and sure. – Dilip Sarwate Jan 16 '15 at 14:33
  • In addition to following the advice in the comments above, you might check my related answer. @DilipSarwate, I think that the OP might be implying not a specific discrete distribution, but a distribution family. This is just a guess, though. – Aleksandr Blekh Jan 16 '15 at 15:27

2 Answers2

3

A test won't tell you that a sample did come from a given distribution, but one might lead you to conclude that it did not. You might, however, conclude that a sample is consistent with having come from a given distribution.

If your outcomes {1,2,3} are simply category labels, so that under random sampling your sample could be regarded as multinomial, the most common test for this would be the chi-square goodness of fit test.

An alternative (that's perhaps not recommended as often as it should be) is the G-test.

In either case, you'll typically need substantially larger samples if you hope to have much power.

In cases where the category labels are interval or ratio (particularly in situations when there are many categories rather than just 3), you may have more interest in smooth alternatives (e.g. location-shift, scale-shift, etc alternatives), in which case more powerful tests are available. If you're primarily interested in a location-shift-like alternatives with 3 categories, you might consider this (but otherwise I'd just stick with one of the above tests, as there's little to gain).

An alternative might be an adapted version of a Kolmogorov-Smirnov or an Anderson-Darling test (they must be adapted because of the discreteness of the distribution -- the usual tests don't have the tabulated distribution). Simulation would be a reasonable way to get the p-value for a test of this particular discrete null.

Glen_b
  • 282,281
  • Hi @glen_b can you look at my edit ? – Oren Jan 19 '15 at 15:36
  • Your edit says "What do you think about the following Python code" ... mostly what I think is that I don't know Python. – Glen_b Jan 19 '15 at 16:59
  • Sorry @Glen_b .. I have now added some text that explain what I did.. Is it correct? – Oren Jan 19 '15 at 17:10
  • Your description was unclear. However, it's quite a different question from the one you started with and should probably be a new question. – Glen_b Jan 19 '15 at 17:41
  • I think that there are too few samples to recommend the chi-square goodness-of-fit test – Finn Årup Nielsen Jan 19 '15 at 17:42
  • If the data in your question is your actual data, no test will be much use to you. I assumed that you were simply giving an example. I presume you're trying to do the 'exact test' version of the multinomial test, is that right? – Glen_b Jan 19 '15 at 17:46
  • Hi @Glen_b, thank you for the help. I am really sorry that I do not manage to explain myself correctly.. Maybe now it is more clear? This is not the actual data but the actual data is similar. I do not know what is 'exact test' – Oren Jan 19 '15 at 17:53
  • In the wikipedia page for the multinomial test that you gave a link to, search for the phrase 'exact test'. Is the statistic at that point there the one you attempted to implement? – Glen_b Jan 19 '15 at 18:01
  • Hi, Yes... did not noticed the link – Oren Jan 19 '15 at 18:08
  • In your own question it says "I got the idea from this link". You didn't notice that there was a link in your question, right after you referred to it? – Glen_b Jan 19 '15 at 23:19
  • Hi @Glen_b I did not notice that this was the name of the test.. Is my test valid ? – Oren Jan 20 '15 at 09:11
  • 1
    Your description of it looks correct. – Glen_b Jan 20 '15 at 14:16
2

The exact multinomial test can be done when there are so few samples. With N=9 there are only 55 possible outcomes. You can compute each probability for the 55 cases.

from math import factorial
from collections import Counter

# https://en.wikipedia.org/wiki/Multinomial_test
def p(probs, counts): 
    result = factorial(sum(counts))
    for prob, count in zip(probs, counts):
        result *= prob ** count / factorial(count)
    return result

z = [b for a, b in sorted(Counter([2,2,2,1,1,3,2,2,1]).items())]
possibles = [(i, j, k) for i in range(10) for j in range(10) for k in range(10) if i + j + k == 9]
assert tuple(z) in possibles

P0 = p((0.2, 0.5, 0.3), z)
Psig = 0
for possible in possibles:
    P = p((0.2, 0.5, 0.3), possible)
    print("{}: {}".format(possible, P))
    if P <= P0:
        Psig += P