I'm trying to replicate a clustering methodology described in a paper as
We define the hetereogeneity metric within a cluster to be the average of all-pair jaccard distances, and at each step merge two clusters if the heterogeneity of the resultant cluster is below a specified threshold
There is a footnote we set the threshold to zero
I'm using scipy, my question isn't about scipy but I'll describe the steps I've taken using that API.
- I start with an array of input examples
x - I create a distance matrix
ywherey[i,j] = jaccard(x[i],x[j]) - I compute the linkage
z=linkage(y, method='average') - I compute the clusters
fcluster(z, t=0.0, criterion='distance')
What I'm uncertain about is the authors description that clusters are merged when the heterogeneity of the resultant cluster is below a specified threshold (which was zero)
I understand t is the threshold that represents the cophenetic distance. Am I correct in assuming that if a cluster contains two identical points (i.e. the jackard distance is zero) and no other points then the heterogeneity is zero and hence using a threshold (cophenetic distance) of zero should identify all clusters that only contain points having a jaccard distance between them of zero?
Or is there a different way to automatically extract clusters where heterogeneity==0?
The main reason for my confusion is the author's conclusion that the accuracy of their method is around 90% but I find setting t=0 always returns singleton clusters. Furthermore it's quite hard to find a t which works well in all cases.