1

I'm trying to replicate a clustering methodology described in a paper as

We define the hetereogeneity metric within a cluster to be the average of all-pair jaccard distances, and at each step merge two clusters if the heterogeneity of the resultant cluster is below a specified threshold

There is a footnote we set the threshold to zero

I'm using scipy, my question isn't about scipy but I'll describe the steps I've taken using that API.

  1. I start with an array of input examples x
  2. I create a distance matrix y where y[i,j] = jaccard(x[i],x[j])
  3. I compute the linkage z=linkage(y, method='average')
  4. I compute the clusters fcluster(z, t=0.0, criterion='distance')

What I'm uncertain about is the authors description that clusters are merged when the heterogeneity of the resultant cluster is below a specified threshold (which was zero)

I understand t is the threshold that represents the cophenetic distance. Am I correct in assuming that if a cluster contains two identical points (i.e. the jackard distance is zero) and no other points then the heterogeneity is zero and hence using a threshold (cophenetic distance) of zero should identify all clusters that only contain points having a jaccard distance between them of zero?

Or is there a different way to automatically extract clusters where heterogeneity==0?

The main reason for my confusion is the author's conclusion that the accuracy of their method is around 90% but I find setting t=0 always returns singleton clusters. Furthermore it's quite hard to find a t which works well in all cases.

0 Answers0