I am looking into calculating distances between vectors for some data analysis. One question I have is whether I should use actual count data or convert to discrete probabilities.
For some distances, the method is clear from the underlying theory (e.g. the Hellinger Distance). However, for other distances, which approach to use is not so clear. I have different references that use one or the other approach. It seems to be quite a subjective call.
There are many examples I could provide so, for the sake of space and simplicity, let’s take the Soergel Distance $(d_s)$ here. (I understand this is a generalised version of the Jaccard Distance).
$$d_s(\mathbf{x},\mathbf{y})=1-{\frac{\sum_i min(x_i , y_i)}{\sum_i max(x_i , y_i)}}$$
Firstly, let’s play with the following vectors using count data (taken from survey data): $\mathbf{x}=(5,13,17,14,7)$ and $\mathbf{y}=(12,10,15,41,19)$. Completing the equation, we get $d_s = 0.500$
Now converting the count values to discrete probabilities (or, proportions, if one prefers), we have $\mathbf{\hat x}=(0.089,0.232,0.304,0.250,0.125)$ and $\mathbf{\hat y}=(0.124,0.103,0.155,0.423,0.196)$. Completing the equation again, we get $d_s = 0.435$
So which is the ‘true’ Soergel distance between the vectors? Or is the respective distance ‘valid’ for each approach, which means stating the context is as critical as stating the distance?