How do test whether two multivariate distributions are sampled from the same underlying population?

Question

Say you are given two multivariate data sets, say an old one and a new one, and that they are supposed to have been generated by the same process(that you have no model for) but perhaps, somewhere along the line of collecting/creating the data, something went awry. You wouldn't want to use the new data as, say, a validation set for the old data or to add to the old data.

You can do a bunch of 1-d stats (per variable), e.g Wilcoxon rank sum, and try some multiple test correction but I'm not sure that's optimal (to capture the intricacies of multivariate data let alone multi-test issues). One way is to use a classifier and see if you can discriminate between the two datasets (given an optimal classifier that's optimal). That does seem to work but still a) perhpas there's a better way b) It's not really designed to tell you why it's different (if nothing else it will use the best predictors and possibly miss other good predictors that were subsumbed by the better ones)

score 4 · Answer 1 · edited Feb 11 '23 at 15:13

4

http://131.95.113.139/courses/multivariate/mantel.pdf

Discusses two possible ways of doing just that if your datasets are the same size. The basic approach is to compute a distance metric between your two observed matrixes. Then to determine if that distance is significant, you use a permutation test.

If your datasets are not the same size then you can use the cross-match test although it does not appear to be very popular. Instead of the cross-match test you can try up or down sampling your data so they are the same size, then using one of the approaches mentioned in the first paper.

edited Feb 11 '23 at 15:13

User1865345

8,202

answered Jun 21 '12 at 01:29

Amit Deshwar

419

You mention if we have an uneven size datasets, use the cross-match test. However following the paper you mention, they use equal equal datasets and look to pair based on distances. Have you found any evidence of this being used? even in the release notes for cross-match, the example uses equal datasets – lukeg Jul 24 '15 at 14:40

score 2 · Answer 2 · edited Feb 11 '23 at 15:11

2

Look up Hotelling's $T^2,$ or if you have really high-dim data, look at this.

edited Feb 11 '23 at 15:11

User1865345

8,202

answered Jul 22 '12 at 05:50

kjetil b halvorsen

77,844

How do test whether two multivariate distributions are sampled from the same underlying population?

2 Answers2

Linked