Similarity/duplicate measure for collection of vectors

Question

I have 12 vectors of the size 1x16, which are generated as a side-product of my algorithm. If any of the vectors are very similar, that could indicate that my algorithm is performing badly. Roughly half of the values at an absolute value below 0.001 (see example below). The similarity measure shouldn't take these into account. It's only problematic if, say, 4 out 6 of the other values are roughly same size and polarity. I'll call these kind of vectors for 'duplicates' from now on.

Consequently, I'd like to get a single measure across all of the vectors that indicates how many of the vectors are duplicates. I've considered using max of correlation across all of the vectors. However, that would lose too much information as I would not be able to see if, suddenly, 4 vectors were somewhat 'duplicates' instead of just one. Does any of you have a suggestion? It should:

Provide a single measure for the problem across all the vectors
Capture duplicates with high values and same polarity while ignoring similarity across small values
Capture more information than just about the worst duplicate

Example of weights:

[[-0.00811 -0.0245   0.05482  0.01891 -0.02844 -0.05945 -0.07583 -0.00773
  -0.0209   0.01005  0.00957 -0.00653 -0.00528  0.00099  0.06555 -0.03687]
 [ 0.03185 -0.0003  -0.04561 -0.0126  -0.0147   0.01412  0.00067  0.01512
   0.0072  -0.03175 -0.0285   0.00716  0.01019  0.00101  0.01687  0.00935]
 [ 0.01864 -0.01127 -0.04923 -0.00753 -0.01005 -0.00171  0.01121  0.01055
   0.00494 -0.01323 -0.03017 -0.02499  0.01433  0.00994  0.0174  -0.04603]
 [ 0.01941  0.00931 -0.07338 -0.02442 -0.00413  0.06842  0.03236  0.02615
  -0.00177 -0.03787 -0.04764 -0.01655  0.01878  0.00717  0.01626  0.03079]
 [ 0.02546  0.00181 -0.05653 -0.00788 -0.0141   0.04684  0.04007  0.02855
   0.00139 -0.04264 -0.03119 -0.02982  0.01441  0.02718  0.01715 -0.00244]
 [ 0.01698 -0.00071 -0.02683  0.00106 -0.00645  0.05156  0.01057 -0.00029
  -0.01263 -0.01852 -0.02875 -0.00654  0.00748  0.00069  0.00362  0.00781]
 [ 0.05699 -0.01782 -0.07301 -0.0191   0.00916  0.00065  0.01114  0.02212
   0.01481 -0.01884 -0.01653 -0.04869 -0.01386  0.01592  0.0386  -0.02921]
 [ 0.03989  0.00169 -0.07417 -0.01567 -0.01181  0.00823  0.03128  0.02953
   0.01478 -0.03537 -0.02955 -0.03212  0.00314  0.03176  0.01733 -0.00544]
 [-0.01041 -0.00762  0.00718  0.00046 -0.0203  -0.00571  0.02237  0.00219
  -0.00966 -0.01272 -0.02451 -0.02471  0.00243 -0.00099  0.0182  -0.02487]
 [ 0.04271 -0.00641 -0.055   -0.00497 -0.00509  0.01417  0.00327 -0.00436
   0.01582 -0.01443 -0.0355  -0.01084 -0.01206 -0.00011  0.02603 -0.02869]
 [ 0.0522  -0.02138 -0.03758  0.00005  0.01031 -0.02306  0.00215 -0.0019
   0.00489 -0.02156 -0.01194 -0.01748 -0.00246  0.01292  0.01723 -0.06472]
 [ 0.00382 -0.02599  0.0512   0.02528 -0.04293 -0.10073 -0.04425 -0.02168
  -0.01311  0.01385 -0.01835 -0.0544  -0.01334 -0.00722  0.10038 -0.00655]]

Perhaps the mean of the Jaccard coefficient with a method for finding the right binarization cut-off would be useful? — pir, Apr 13 '15 at 13:20
Even better, the mean of the max Jaccard coefficient found for each vector when matched against all other vectors would be useful. — pir, Apr 13 '15 at 13:32

Similarity/duplicate measure for collection of vectors

0 Answers0