0

Suppose that we have the following contingency matrix :

   |  T1 |  T2  |  T3
---------------------
C1 |  0  |  53  |  10
C2 |  0  |  1   |  60
C3 |  0  |  16  |  0

source : How to calculate purity?

As mentioned in the source , the purity is computed as :

Purity = (53 + 60 + 16) / 140 = 0.92142

This seems contradictory to me because from what i understood :

  • If the purity is equal to 1 then the confusion matrix must be diagonal ( which means the accuracy of clustering is 100% ). However, we have here a purity value that is near the "1" , and if we read the confusion matrix above it's clear that the accuracy of the used clustering method is low. ( because of the total number of observations and the first column equal to 0 ).

My questions :

  • Am I understanding the purity concept correctly? if-else could someone clarify the situation i presented from the mentioned source?

Thank you in advance for help !

Tou Mou
  • 113
  • 1
    Re "then the confusion matrix must be diagonal." That's a little different than explained in your referenced thread. See https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation for instance. – whuber Aug 13 '20 at 19:25
  • @whuber , purity is a new concept i try to learn. As i know , if we draw the contingency matrix that maps the real clusters ( from the human view ) and the estimated classes ( using the clustering algorithm ) and if this contingency matrix is diagonal then of course the accuracy is 100%. – Tou Mou Aug 13 '20 at 20:25
  • @whuber, i'm searching a simple and reproducible example of how to compute purity! but before : if purity is close to 1, is this means that the clustering accuracy is very high or is the opposite ? – Tou Mou Aug 13 '20 at 20:28
  • 1
    You appear to make a logical error in supposing the implication runs the other way: although a diagonal confusion matrix implies 100% purity, 100% purity does not imply the matrix is diagonal! The simple and reproducible example you request is posted in the thread you cite in your question. – whuber Aug 13 '20 at 21:30
  • 1
    This means that purity and accuracy aren't the same things: i could now understand that purity measures if there are " strange observations/points" for each obtained class. That is to say , if the purity is equal to 1 : then each obtained class ( by a classification algorithm) matches to excactly one cluster. – Tou Mou Aug 13 '20 at 23:25
  • So purity measures how the algorithm is able to identify the clusters of reference ( the algorithm should first identify a number of classes that is = number reference clusters ) but the accuracy will depend in classes labels order . If ( purity = 1) and ( accuracy < 100% ) then we will need to reassign classes labels. – Tou Mou Aug 13 '20 at 23:44

0 Answers0