1

In papers about unsupervised clustering I see a lot of references to a metric "clustering accuracy" or "unsupervised clustering accuracy" (ACC) which is usually defined as something like: $$ ACC(x,y) = \frac{1}{N} \max_{m}\sum^{N}_{i=1}1(m(x_i) = y_i) $$ where m is all possible permutations of the x and y. I think the 1 is something like the dirac-delta function but I'm not sure. Most papers I have seen also mention that the mapping can be solved using the Kuhn–Munkres algorithm.

But I can't find an original source for the definition of this metric. Most papers I read reference other recent papers which are using ACC but are not the source defining it. Others don't reference the formula at all (while they do reference the formula for NMI for example). I saw one paper that referenced a book "Matching Theory" but I couldn't find this mentioned in that either (though I searched for it through the PDF).

When I google it or search for it here I get results about clustering accuracy in general. It doesn't even appear to have a Wikipedia article which like other metrics do.

What I'm looking for is a source about about this metric, whether it has some other name which is why I can't find it, and an explanation of it and how it differs from other metrics.

User1865345
  • 8,202
Cyo
  • 11
  • Are you sure this is "unsupervised"? Is $y$ not a cluster label? You might need to provide some more details here. – Eoin Nov 09 '23 at 12:30
  • 1
    I don’t see how data used to develop clusters can be used to assess cluster “accuracy”. Instead I’d spend time verifying that the clusters are really clusters, i.e., that they are compact sets and satisfy (1) distance from cluster centers are not informative once cluster labels are defined and (2) no observation in a cluster is closer to an observation in another cluster than it is to the observations in its “own” cluster. People frequently build clusters that are not compact and they turn out to not be so useful, and to hide heterogeneity. – Frank Harrell Nov 09 '23 at 13:34
  • Instead of answering the question, I would suggest to you to read a short overview of the external cluster validity indices. https://stats.stackexchange.com/q/586470/3277 – ttnphns Nov 09 '23 at 14:36
  • @FrankHarrell With a finite number of points, how can there be a lack of compactness? – Dave Nov 09 '23 at 14:39
  • 2
    Easy. Many clustering methods will find clusters even when the data represent pure noise. When such raw data have wide distributions, the found clusters are often extremely large and virtually meaningless. The fact that something is labeled as a cluster by an algorithm does not make it a cluster. – Frank Harrell Nov 09 '23 at 16:13
  • Thank you all for your discussion so far. @Eoin is correct that the $y$ is the true cluster label, or another clustering that we are comparing to. However this is called "unsupervised clustering accuracy" or "clustering accuracy" in the papers I have read, so I'm looking for a source of where it comes from. Other metrics like ARI which are common to clustering also need the cluster labels to be available, the different is that they are agnostic to what class elements are put in as long as elements of each true cluster are put in the same together. – Cyo Nov 17 '23 at 04:59
  • @ttnphns Thanks for your comment. I'm aware of external cluster validity indices, I'm looking for the source of this particular one that I see often reference. – Cyo Nov 17 '23 at 05:01

1 Answers1

0

Given your definition (and the clarification in the comments), $\text{ACC}(x, y)$ is just the proportion of cases where the predicted cluster label, $m(x)$, matches the ground truth label $y$, with the added twist that you take the maximum over possible cluster label permutations to account for the possibility you've ended up with the labels the wrong way around. This is a pretty elementary idea, so I wouldn't expect a source or reference for it.

Eoin
  • 8,997