2

I am looking into calculating distances between vectors for some data analysis. One question I have is whether I should use actual count data or convert to discrete probabilities.

For some distances, the method is clear from the underlying theory (e.g. the Hellinger Distance). However, for other distances, which approach to use is not so clear. I have different references that use one or the other approach. It seems to be quite a subjective call.

There are many examples I could provide so, for the sake of space and simplicity, let’s take the Soergel Distance $(d_s)$ here. (I understand this is a generalised version of the Jaccard Distance).

$$d_s(\mathbf{x},\mathbf{y})=1-{\frac{\sum_i min(x_i , y_i)}{\sum_i max(x_i , y_i)}}$$

Firstly, let’s play with the following vectors using count data (taken from survey data): $\mathbf{x}=(5,13,17,14,7)$ and $\mathbf{y}=(12,10,15,41,19)$. Completing the equation, we get $d_s = 0.500$

Now converting the count values to discrete probabilities (or, proportions, if one prefers), we have $\mathbf{\hat x}=(0.089,0.232,0.304,0.250,0.125)$ and $\mathbf{\hat y}=(0.124,0.103,0.155,0.423,0.196)$. Completing the equation again, we get $d_s = 0.435$

So which is the ‘true’ Soergel distance between the vectors? Or is the respective distance ‘valid’ for each approach, which means stating the context is as critical as stating the distance?

anna6931
  • 121

2 Answers2

1

I know nothing of Soergel distance, but clearly $d_s$ is invariant to multiplicative constants, that is $d_s(x,y) = d_s(\alpha x , \alpha y)$ but in the case of the proportions you are not multiplying by the same constant $\alpha$. $x$ and $y$ are being multiplied by different constants. So it is not surprising that the distance is different.

That is the math. What does it means? Distances are almost never invariant to multiplicative constants. Think about euclidian distances - the distance for the count is different that the distance for the proportions. So distances almost always depend on what is the measures of the components, and it is also true here. The Soergel distance for counts is different from the Soergel distance for proportions.

  • So the distance depends on the approach (i.e. using counts or proportions) - I understand that. However, how do we determine which approach is the 'most appropriate' when comparing vectors? – anna6931 Oct 21 '22 at 08:40
1

As said in the answer by Jacques, the distance between the vectors will depend on whether counts or proportions are used. I have not some across any reference that explicitly states which approach is the best or, more correctly, which is the most appropriate.

A couple of observations may be able to guide you. Firstly, the distance for counts is strongly conditional (biased) on the total count of observations. It is possible to get a situation where the minimum values could be all represented in vector $\mathbf{A}$ with all the maximum values contained in vector $\mathbf{B}$. For example, let $\mathbf{A}=(5,7,3,6,10)$ and $\mathbf{B}=(7,12,9,12,14)$. [In a (multi)set theoretic sense, $A \subset B$]. The Soergel Distance for the count data between $\mathbf{A}$ and $\mathbf{B}$ is 0.426

When converting to proportions, the Soergel Distance comes out to be 0.179. Given this distance has the bounds of $[0,1]$, this is clearly a 'notable' difference. Also, from a set theoretic view, the proportional approach $A \not\subset B$. Furthermore it seems counterintuitive that while the count data is a subset, it is 'further apart' than the proportions which have overlapping sets.

In my view, I would choose the proportions approach as it is less influenced by the differences in the total counts of each vector. And proportions, being a proxy for probabilities, offer better comparability. The count approach is more applicable when the total counts of each vector are 'roughly similar'. However, as to what is accounts for 'roughly similar', that's were subjectivity and a good dose of pragmatism comes in, not to mention professional/academic judgement.

Mari153
  • 880
  • Thanks. Overall the best of the two answers, especially around the influence the difference in the total counts of each vector have on the result. I can see how this effects the distance now. I'm still interested to find a definitive reference on which is the best approach and why. – anna6931 Oct 24 '22 at 07:23