2

Say that I have a real-valued discrete distribution $p(x)$ and $N$ samples, $x_1, \ldots, x_N$, and I want to test whether the samples came from the distribution without making any further assumptions whatsoever. Note that there are very few samples; in the application that motivated me to make this post, we have $N = 5$. Thus Kolmogorov-Smirnov and Chi-squared tests are not expected to have much power.

I had a simple idea for doing this under the assumption that one can efficiently sample from $p(x)$. Being a bit statistically naive, I'm having difficulty figuring out whether it exists in the literature or not, and hoping someone can point me to the right resource.

The idea in a nutshell is to compare the self-information of the sample, $\hat{I}_N$, to the distribution of the self-information $I_N$ of $N$ random samples from $p(x)$. Formally, recall that the self-information of a random variable $X$ having distribution $p$ is given by

$$ I = - \log p(X), $$

and the self-information of $N$ iid random variables $X_1,\ldots X_N$ each having distribution $p(x)$ is given by

$$ I_N = -\sum_{i=1}^N \log p(X_k). $$

The self-information of the sample data we have, $x_1,\ldots,x_N$ is denoted

$$ \hat{I}_N = -\sum_{k=1}^N \log p(x_k). $$

As a test statistic one might consider $\hat{C} = C(\hat{I}_N)$, where $C(s) = \mathbb{P}(I_N < s)$ is the cumulative distribution function. If $\hat{C}$ takes a value very close to 0 or 1, then it has an extreme value compared to the distribution of $I_N$, and is unlikely to come from $p(x)$. For concreteness, we could use the standard thresholds for extreme value tests, such as 0.95 and 0.05 for the high and low ends respectively.

As an example application of this test, say that $p(x)$ is some strange multimodal distribution with multiple humps, and the samples $x_1,\ldots,x_N$ lie in the valleys between the humps. It is not clear that there is any test suitable from the literature for such a problem, but intuitively we can see that the samples are unlikely to have come from $p(x)$ because the values of $p(x_i)$ are so small, or equivalently, the self-information $-\log p(x_i)$ is much larger than typical. In terms of the above discussion, $\hat{C}$ will be very close to 1, and the hypothesis will be rejected.

The main problem I see with this test is that the distribution of $I_N$ may be difficult to compute. That may be, but in many cases I would imagine that a few million (or billion) Monte Carlo samples would suffice to get a good approximation of the distribution. Analytical/asymptotic approximations could be used to speed things up, get theoretical results, etc. For example, the first moment of $I_N$ is $N$ times the Shannon entropy, and higher moments could be computed without great difficulty.

Paul
  • 10,920
  • 1
    There are many suitable distribution tests that apply here and are powerful (most likely they are too powerful, but that's another issue altogether), including the Chi-squared test and the Kolmogorov-Smirnov test. So what "resource" do you seek: a reference to some test that looks like yours or some test that will work well in the situations you describe? – whuber Oct 21 '13 at 21:06
  • Oops, I forgot to say that $N$ is very small. My understanding is that Kolmogorov-Smirnov and Chi-squared are not suitable for small $N$, since you need enough samples to give you a reasonable cumulative distribution function or histogram. I don't know a test that applies to this case. Edited post to emphasize this point. – Paul Oct 21 '13 at 21:11
  • Also, I am not wedded to this idea -- I just thought it might be more powerful than K-S or Chi-squared at low sample size. Forming the CDF of 5 samples strikes me as a bit ham-fisted. My boss wants me to perform a test of this sort and I just want the best tool for the job, and I don't care if it's my idea or something totally unrelated. – Paul Oct 21 '13 at 21:27
  • K-S has no problems with small data sets. You are unlikely to find anything more powerful than it unless you make strong distributional assumptions. I don't understand why you need to construct a "reasonable CDF": the relevant CDF comes from accumulating $p(x)$, not from observing samples. I also don't understand what "further assumptions" you could possibly apply, given that $p(x)$ appears fully to specify your reference distribution. These, and several other disconnects, make me worry that you might not have communicated your problem accurately. Perhaps you could provide an example? – whuber Oct 21 '13 at 21:31
  • In K-S the test statistic is $\sup_t |F_N(t) - F(t)|$ where $F_n$ is the empirical CDF of the sample and $F$ is the CDF of $p$, so actually the sample CDF is relevant, and having few samples is a problem since the Kolmogorov distribution is only the asymptotic distribution of the test statistic for large $N$. Of course, almost all tests are justified by asymptotic results, but it was still a concern for me. When I said "further assumptions" I just meant I didn't want to make any distributional assumptions. Any other questions? – Paul Oct 21 '13 at 21:38
  • 1
    What you are saying, then, is that the K-S test has little power when $N=5$. But so do all other distributional tests. A good way to overcome the problem of knowing only the asymptotic distribution is with Monte Carlo simulation. In effect, then, it seems that you ought to be wondering whether there is a better test statistic than either the K-S or the $\chi^2$ to test your particular null hypothesis. – whuber Oct 21 '13 at 23:27

0 Answers0