Kolmogorov–Smirnov test for multivariate data

Question

I have a set of files consisting of randomly selected points from a dataset, each file belonging to a particular class. Each row in these files contains the coordinates in n-space of the point. I'd like to compare the distributions in n-space of each of these files - and am inspired by the K-S test for comparing histograms. From what I've read this method doesn't extend well to multivariate data. I had previously used PCA - but all of my variance collapsed into a single noisy dimension and clustering methods were useless.

My question - is there a reason I shouldn't just use an average of the K-S values across the histogram for each of the n-dimensions as a metric for the goodness of fit? Is there a better method for comparing these distributions?

score 3 · Answer 1 · answered Apr 19 '12 at 15:09

I'd calculate the mean $\overline x$ and the covariance matrix $C$ of the joint data set, and then do a K/S test on the univariate quantity $V(x):=(x-\overline x)^TC^{-1}(x-\overline x)$ evaluated on the parts. If the K/S test give a significant difference between the parts, there is one. If it gives no significant difference, the test is to be regarded as unconclusive.

dmckee --- ex-moderator kitten · Answer 2 · 2012-01-10T23:02:35.617

3

ROOT supports Kolmogorov tests on higher dimensional histograms, and the notes (for the 2D version) suggest that there is a ambiguity--which they deal with by punting: calculate it both ways. I don't know if the code contains anymore details, but the comments sometimes have references to papers and the like.

There are some additional interesting comments in the notes to TH1::KolmogorovTest.

edited Jan 10 '12 at 23:02

answered Jan 10 '12 at 04:11

dmckee --- ex-moderator kitten

1,115
8
17

Kolmogorov–Smirnov test for multivariate data

2 Answers2