8

I have a set of files consisting of randomly selected points from a dataset, each file belonging to a particular class. Each row in these files contains the coordinates in n-space of the point. I'd like to compare the distributions in n-space of each of these files - and am inspired by the K-S test for comparing histograms. From what I've read this method doesn't extend well to multivariate data. I had previously used PCA - but all of my variance collapsed into a single noisy dimension and clustering methods were useless.

My question - is there a reason I shouldn't just use an average of the K-S values across the histogram for each of the n-dimensions as a metric for the goodness of fit? Is there a better method for comparing these distributions?

bab
  • 181
  • 2

2 Answers2

3

I'd calculate the mean $\overline x$ and the covariance matrix $C$ of the joint data set, and then do a K/S test on the univariate quantity $V(x):=(x-\overline x)^TC^{-1}(x-\overline x)$ evaluated on the parts. If the K/S test give a significant difference between the parts, there is one. If it gives no significant difference, the test is to be regarded as unconclusive.

Arnold Neumaier
  • 11,318
  • 20
  • 47
3

ROOT supports Kolmogorov tests on higher dimensional histograms, and the notes (for the 2D version) suggest that there is a ambiguity--which they deal with by punting: calculate it both ways. I don't know if the code contains anymore details, but the comments sometimes have references to papers and the like.

There are some additional interesting comments in the notes to TH1::KolmogorovTest.