Is it possible to calculate the probability that two samples are drawn from the same distribution?

Question

Given $X_1$..$X_n$ and $Y_1$..$Y_n$ drawn from unknown distributions $F(x)$ and $G(x)$ respectively, statistical tests such as two-sample Kolomogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests have been devised to test the null hypothesis $\mathcal{H}: F(x) = G(x)$ using various test statistics.

But rather than hypothesis testing, I am more interested in quantifying the probability, $\mathcal{P}(X_1..Xn,Y_1..Y_n | F(x) = G(x))$ without knowing $F(x)$ or $G(x)$. Is this possible?

The above tests give $\mathcal{P}(Z\ge{}z | F(x) = G(x))$ where $Z$ is the test statistic, but this is not what I want.

I tried to find how the two-sample Kolomogorov test statistic was derived, but nothing useful came up... All the papers I have found either simply state the proposed test statistic or study its distribution under the null hypothesis.

Given order statistics $X_{(1)}$..$X_{(n)}$ and $Y_{(1)}$..$Y_{(n)}$, is there anything useful we can say about $X_{(i)} - Y_{(i)}$ irrespective of the distributions $F(x)$ and $G(x)$?

On the question: "Is it possible to calculate the probability that two samples are drawn from the same distribution?" -- the short answer is 'no'. How would you distinguish between $F=G$ and something very close to equality but still unequal? The longer answer is "maybe you kind of can do that, if you're a Bayesian". — Glen_b, Dec 21 '13 at 20:56
Conventionally, you need to have a prior, and some kind of likelihood to obtain a posterior. This is somewhat difficult if you can't specify F and G, but there may be some kind of nonparametric approach possible here (some kind of mixture modelling perhaps?). You might then be comparing a model where all the components were the same to one where there were one or more components in the mixture that were not the same for both. You still don't avoid the problem that you really can't tell F=G from F and G being very very close but different, and I can't think of a good way of avoiding that. — Glen_b, Dec 21 '13 at 21:02
My question specifically asks about P(X1..Xn,Y1..Yn|F(x)=G(x)). If I understand Bayes' Rule correctly, this is the likelihood. So how can I calculate this? I want to calculate a probability, since I can't know for sure whether F = G. I'd like to quantify the uncertainty in my belief that F = G given the observed samples. — David Shih, Dec 21 '13 at 21:04
You might like to take a look at the definition of likelihood. — Glen_b, Dec 21 '13 at 21:07
You're right. Likelihood is defined based on a set of parameters, not a hypothesis. But more concretely, how would you derive this likelihood you refer to? Thanks for your help. — David Shih, Dec 21 '13 at 21:58
You're outside my area of expertise, but the nonparametric approach I was suggesting was that you could set up some kind of mixture model for F and G. I can't tell you what kind of mixture exactly (and without that, you don't get a likelihood), because that would require me to make assumptions you won't make explicit; I sure as heck can't make them explicit for you. — Glen_b, Dec 21 '13 at 22:54
You are trying to transform a matter of distance between two distributions, $d(F,G)$ into an event, ${F=G}$, and then assign some probability to it, conditional on some observed data. If you partition the event space into ${F=G}$ and ${F\neq G}$, even conditional on some data, do you think that the event ${F=G}$ can be assigned any other probability except zero? — Alecos Papadopoulos, Dec 21 '13 at 23:23
Note that the question in the second paragraph is quite different from the question in the title. — Glen_b, Sep 05 '17 at 00:29

Is it possible to calculate the probability that two samples are drawn from the same distribution?

0 Answers0

Linked