38

I'm looking for some good terminology to describe what I'm trying to do, to make it easier to look for resources.

So, say I have two clusters of points A and B, each associated to two values, X and Y, and I want to measure the "distance" between A and B - i.e. how likely is it that they were sampled from the same distribution (I can assume that the distributions are normal). For example, if X and Y are correlated in A but not in B, the distributions are different.

Intuitively, I would get the covariance matrix of A, and then look at how likely each point in B is to fit in there, and vice-versa (probably using someting like Mahalanobis distance).

But that is a bit "ad-hoc", and there is probably a more rigorous way of describing this (of course, in practice I have more than two datasets with more than two variables - I'm trying to identify which of my datasets are outliers).

Thanks!

Emile
  • 1,097

5 Answers5

19

Hmm, the Bhattacharyya distance seems to be what I'm looking for, though the Hellinger distance works too.

Emile
  • 1,097
18

There is also the Kullback-Leibler divergence, which is related to the Hellinger Distance you mention above.

Gavin Simpson
  • 47,626
  • 3
    can one calculate the Kullback-Leibler divergence of points without making an assumption of the underlying probability density the points came from ? – Andre Holzner Nov 06 '10 at 16:22
14

Heuristic

  • Minkowski-form
  • Weighted-Mean-Variance (WMV)

Nonparametric test statistics

  • 2 (Chi Square)
  • Kolmogorov-Smirnov (KS)
  • Cramer/von Mises (CvM)

Information-theory divergences

  • Kullback-Liebler (KL)
  • Jensen–Shannon divergence (metric)
  • Jeffrey-divergence (numerically stable and symmetric)

Ground distance measures

  • Histogram intersection
  • Quadratic form (QF)
  • Earth Movers Distance (EMD)
skyde
  • 465
10

The most complete survey is provided in Statistical Inference Based on Divergence Measures by Leandro Pardo, Complutense University, Chapman Hall 2006.

whuber
  • 322,774
1

Few more measures of "Statistical Difference"

  • Permutation test (by Fisher)
  • Central Limit Theorem & Slutsky’s theorem
  • Mann-Whitney-Wilcoxin test
  • Anderson–Darling test
  • Shapiro–Wilk test
  • Hosmer–Lemeshow test
  • Kuiper's test
  • kernelized Stein discrepancy
  • Jaccard similarity
  • Also, hierarchical clustering deals with similarity measures between groups. The most popular measures of group similarity are perhaps the single linkage, complete linkage, and average linkage.