Say that I have a real-valued discrete distribution $p(x)$ and $N$ samples, $x_1, \ldots, x_N$, and I want to test whether the samples came from the distribution without making any further assumptions whatsoever. Note that there are very few samples; in the application that motivated me to make this post, we have $N = 5$. Thus Kolmogorov-Smirnov and Chi-squared tests are not expected to have much power.
I had a simple idea for doing this under the assumption that one can efficiently sample from $p(x)$. Being a bit statistically naive, I'm having difficulty figuring out whether it exists in the literature or not, and hoping someone can point me to the right resource.
The idea in a nutshell is to compare the self-information of the sample, $\hat{I}_N$, to the distribution of the self-information $I_N$ of $N$ random samples from $p(x)$. Formally, recall that the self-information of a random variable $X$ having distribution $p$ is given by
$$ I = - \log p(X), $$
and the self-information of $N$ iid random variables $X_1,\ldots X_N$ each having distribution $p(x)$ is given by
$$ I_N = -\sum_{i=1}^N \log p(X_k). $$
The self-information of the sample data we have, $x_1,\ldots,x_N$ is denoted
$$ \hat{I}_N = -\sum_{k=1}^N \log p(x_k). $$
As a test statistic one might consider $\hat{C} = C(\hat{I}_N)$, where $C(s) = \mathbb{P}(I_N < s)$ is the cumulative distribution function. If $\hat{C}$ takes a value very close to 0 or 1, then it has an extreme value compared to the distribution of $I_N$, and is unlikely to come from $p(x)$. For concreteness, we could use the standard thresholds for extreme value tests, such as 0.95 and 0.05 for the high and low ends respectively.
As an example application of this test, say that $p(x)$ is some strange multimodal distribution with multiple humps, and the samples $x_1,\ldots,x_N$ lie in the valleys between the humps. It is not clear that there is any test suitable from the literature for such a problem, but intuitively we can see that the samples are unlikely to have come from $p(x)$ because the values of $p(x_i)$ are so small, or equivalently, the self-information $-\log p(x_i)$ is much larger than typical. In terms of the above discussion, $\hat{C}$ will be very close to 1, and the hypothesis will be rejected.
The main problem I see with this test is that the distribution of $I_N$ may be difficult to compute. That may be, but in many cases I would imagine that a few million (or billion) Monte Carlo samples would suffice to get a good approximation of the distribution. Analytical/asymptotic approximations could be used to speed things up, get theoretical results, etc. For example, the first moment of $I_N$ is $N$ times the Shannon entropy, and higher moments could be computed without great difficulty.