1

How would one code the following issue:

The columns represent the k groupings from the k cluster algorithms. The rows are the N(N-1)/2 pairs of data points. If an algorithm groups a pair together (i,j) = 1 and 0 otherwise. This results in a binary matrix of size N(N-1)/2 x k. Agreement occurs if algorithms both group a pair of data points together.

How would I calculate, i.e. program, this number for all k combinations of algorithms?

https://link.springer.com/article/10.1007/BF01908075

a in this case represents the pairs of data points grouped together by algorithms 1, 2, 3, ..., k or any other combination of algorithms, for example, 2, 3 and k or 1, 2, 3 and 4. This will allow us to make an UpSet plot:

https://en.wikipedia.org/wiki/UpSet_Plot#:~:text=UpSet%20plots%20are%20a%20data,sets%20(or%20vice%20versa)

and

https://kieranhealy.org/blog/archives/2020/04/16/upset-plots/

  • Comembership confusion matrix - which count is a pair of objects - is a relation between two partitions, a 2x2 frequency table. And I could not understand your setting since I don't see a relation between partitions (columns of you data). Will you explain, in your question, in greater detail what you want to do? – ttnphns Dec 06 '22 at 19:41
  • @ttnphns a in this case represents the pairs of data points grouped together by algorithms 1, 2, 3, ..., k or any other combination of algorithms, for example, 2, 3 and k or 1, 2, 3 and 4. This will allow us to make an upSet plot. – Sean_TBI_Research Dec 06 '22 at 19:48
  • @ttnphns Hence, we compare more than two partitions. The binary matrix is preliminary to the frequency table from which we infer a. We would continue doing this until we looped over all algorithm combinations: for k = 3 we have four values for a. – Sean_TBI_Research Dec 06 '22 at 20:02
  • So, you extract some combination of columns from your binary matrix and you want to know how many rows in this extracted submatrix is full of 1s. Right? – ttnphns Dec 06 '22 at 20:02
  • @ttnphns Correct – Sean_TBI_Research Dec 06 '22 at 20:04
  • @ttnphns See also last hyperlink. – Sean_TBI_Research Dec 06 '22 at 20:12
  • I see. OK, that will be an entirely programming (not sratistical) question/task much tied with the language and functions you are going to use. The question is how to sift all the many combinations most quickly, an that might depend programming approaches to process arrays. You task is very similar to what TURF analysis in marketing is doing, so try to find some ready solutions on that side. If k is greater than 15 or 17, all combinations may take many minutes/hours. – ttnphns Dec 06 '22 at 20:12
  • @ttnphns I understand. Python is the language. What is the general formula for all combinations for k columns? I should search the web for TURF analysis? – Sean_TBI_Research Dec 06 '22 at 20:14
  • I am not examining the link. But you could send me a copy of the article if you agree. – ttnphns Dec 06 '22 at 20:14
  • @ttnphns It's a blog post, it shows the idea behind a Venn diagram for more than three circles. – Sean_TBI_Research Dec 06 '22 at 20:16
  • I think there is no formula. No "magic" formula to sort through all the combinations exist. Technical solutions may differ depending on your language, available matrix/restructuring functions, and your creativity. – ttnphns Dec 06 '22 at 20:18
  • Yes I think you should read about TURF analysis and how they implement it. – ttnphns Dec 06 '22 at 20:20
  • @ttnphns But first I would need this binary matrix, TURF analysis happens on that binary matrix. How would I create the binary matrix first? – Sean_TBI_Research Dec 06 '22 at 20:25
  • Easy. Let V be the vector of your objects' cluster membership (labels) obtained via a clustering algorithm. Propagate the vector into the square matrix M. Then get the binary matrix B equal to M=M'. That is, transpose M and compare it with M. Vectorize (i.e, unwrap) one of two triangles of B into the column C. So this is one of the columns in your binary matrix. – ttnphns Dec 06 '22 at 20:38
  • @ttnphns Your linear algebra solutions counts the 'a' from the pairconfusion matrix; is there a way to also consider the 'd'. Where 'd' stands for 'both do not group together' – Sean_TBI_Research May 12 '23 at 10:09

1 Answers1

0
cluster=c(2,1,2,0,0,3)                 # kmedioids
cluster=cbind(cluster,c(0,0,0,0,0,1))  # spectral
n=nrow(cluster)
N=n*(n-1)/2                            # aantal paren 
A=matrix(NA,N,2)
for (k in 1:ncol(cluster)){
  rowcounter=1
  for (i in 1:(n-1)){
    for (j in (i+1):n){
      A[rowcounter,k]=cluster[i,k]==cluster[j,k]
      rowcounter=rowcounter+1
    }
  }
}

But unfeasible for Python! Re-write appreciated