1

I have an Origin-Destination matrix expressing (weekly) flows of people between every couple of nodes (cities). The number of people traveling from city $i$ to city $j$ in a specific week is $OD_{ij}$. My goal is to define a distance metric, induced by these flows of people, that allows me to cluster the cities into groups and compare these groups with the groups of cities derived from clustering using geographical distances. Two cities should be close according to my distance if a lot of people are traveling between them.

First, I have to obtain a symmetrical matrix, since distances are symmetrical. I solved this problem easily by computing for each couple $$ OD_{ij}^{'} = \frac{OD_{ij}+OD_{ji}}{2} $$

My difficulty is how to define the mobility-induced distance metric. This metric should work well with a hierarchical clustering method. I thought about $$ d_{ij} = \frac{1}{1 + OD_{ij}^{'}} $$

This metric respects the property of having a low value for highly connected cities, and a low value for cities lowly connected. However, it does not respect triangular inequality and moreover, it does not work well with any kind of linkage in hierarchical clustering, because when I am computing the distances between a newly formed cluster A and every other cluster B, I would like to compute $$ d_{AB} = \frac{1}{1 + \sum_{i\in A, j \in B}{OD_{ij}^{'}}} $$ But no known linkage defined in classical hierarchical clustering (complete, single, average, Ward, etc.) will allow me to do something remotely similar.

Does someone have some thoughts about this problem and how to define the distance metric? I am also having difficulties in finding papers about studies doing something similar to this. Any help would be greatly appreciated.

EFG1595
  • 11
  • Though not an answer, this may help you sort out among the linkage methods and what distances they require. https://stats.stackexchange.com/a/217742/3277 – ttnphns Nov 11 '22 at 23:05
  • Why are you after a distance (dissimilarity)? Why not use the original flow as a similarity. Average, complete, single methods are equally suitable for similarities or dissimilarities. – ttnphns Nov 11 '22 at 23:08
  • @ttnphns, thank you for the informative post about linkage methods. I am trying to transform my similarities into distances since hierarchical clustering (I am using function hclust in R) works by aggregating the closest nodes, i.e., the ones having the lowest value of the distance matrix I am providing function hclust. Thus, I need to transform my similarity matrix into a distance matrix to give to hclust. Have you something different in mind? I am very curious about it. – EFG1595 Nov 12 '22 at 09:56
  • 1
    It is strange that hclust works only with distances (I'm not R user so can't comment). Anyway, if you make your mind that the raw flow is a suitable similarity measure for you, you could linearly convert it into the dustance by negating and then adding a constant to make the values positive. Then methods complete, single, average methods will yield the same clustering results as them with the similarities. – ttnphns Nov 12 '22 at 14:25
  • Indeed @ttnphns suspicion is right; you can use a dissimilarity matrix in hclust and in any distance-based clustering method. The reason we prefer distances is that in some rare cases dissimilarities may lead to a "strange" dendrogram. How to convert a dissimilarity matrix to a distance matrix is a one-million-dollar question. I'll be happy to see if someone comes out with a recipe for it. – utobi Nov 12 '22 at 20:33
  • In these days, I found some helpful tips in the Johnson Wichern book Applied Multivariate Statistics, chapter 12. It deals with the problem of converting similarities into dissimilarities by defining a monotone function of the similarity, like the one I suggested in my answer. To convert these to distance, to address the point of @utobi, I was thinking about applying Multidimensional Scaling (better, Nonmetric Multidimensional Scaling). – EFG1595 Nov 14 '22 at 14:05

0 Answers0