5

Hello Stack Exchange community,

I'm relatively confused on what to do in this scenario. I am running an experiment that retrieves data of interest and stores it as an array. Experimental conditions are slightly different, and for my specific application, it is important to measure heterogeneity of my data. I was asked to use a distance metric for my 2D histograms generated post data analysis.

However, there are a wide array of options available after doing a literature search. There is the "Earth Mover's Distance", the Jensen-Shannon metric (the square root of the Jesen-Shannon divergence value), Bhattacharyya distance, Minkowski distance, etc.

I understand that these distances are defined differently, quantitatively speaking, and thus have difference implications, but for my example, what seems to be an efficient metric at quantifying differences in histograms (i.e, getting to the end goal of assessing heterogeneity)?

Thank you

1 Answers1

4

There are two other metrics that reportedly work well with histograms:

  1. Chi-square distance $ \chi^2 (x,y)=\frac{1}{2} \sum _{i=1} ^{d} \frac {(x_i - y_i)^2} {x_i+y_i}$
  2. Histogram intersection kernel (similarity) $k(x,y)=\sum _{i=1} ^{d} \min (x_i,y_i)$

I used these in my PhD thesis for image-based localization indoors. The image descriptors were essentially histograms and these two measures were found to be useful.

  • I am interested in looking at differences in data distribution for another problem - a deep learning model on imagery in my case. I want to see how the distribution of pixel values of the data used to train the model differs from the data the model is being applied to. in trying to find the best way to do so i stumbled upon your answer. this led me to trying to figure out what image localization is, to see if these metrics could be applied in my case. i found a paper, but am still unsure exactly what IL is and how these metrics helped you. could you expand on this and IL in general? Thanks! – user20408 Aug 09 '22 at 16:48
  • the main problem i am seeing for me is that these two datasets - the training set and the applied set - have a different number of pixels/values. so it seems that neither of these metrics would work as is – user20408 Aug 09 '22 at 16:51