External validation of clustering requires labels, but why cluster at all if you have labels?

Question

There are two types of validation in clustering, using:

Internal indexes: Used to measure the goodness of a clustering structure without respect to external information (e.g., sum of squared errors)
External indexes: Consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels (e.g., Rand index, purity, etc.)

I'm confused on the use of external validation indexes in clustering. Since the class labels are known, why use clustering (i.e., unsupervised learning) instead of supervised learning (e.g., SVM, etc.)?

The external indices are used also to compare two partitions in general case, i.e. when there is no "true" one among the two. And, of course, as said in the answers below, they are used to test the quality of results of a clustering or of a classification algorithm. — ttnphns, Nov 20 '18 at 08:29
thank you @ttnphns but what do you mean by "TRUE" partition ? because the way I understood it, external validation it's when you validate a partition (result of your clustering) by comparing it with the "correct"/"true" partition (externally provided class labels) — learners, Nov 20 '18 at 14:52
What ttnphns said is that this is not the only use case for these measures. You can also compute, e.g., stability of a method by evaluating the similarity between different unsupervised results. The "evaluation" use case is indeed academic and not applicable in many real scenarios. But stability is. — Has QUIT--Anony-Mousse, Nov 23 '18 at 09:21

Simone · Answer 1 · 2018-11-20T07:14:51.063

3

External validity indices are used when you propose a new clustering technique and you want to validate it or you want to compare it to existing techniques. In these cases, you get a bunch of datasets for which you know the ground truth and see if your clustering technique is able to produce clustering solutions that are similar to it.

edited Nov 20 '18 at 07:14

answered Nov 19 '18 at 16:59

Simone

7,078

Only in this case ? so it's always frowned upon to use clustering when we have class labels? Why can't we not argue that clustering algorithms are less complex since they use notions like distance, mean ,median etc; whereas supervised learning you have to adjust complex parameters and it's time consuming and so on.... – learners Nov 20 '18 at 14:39
1

Class labels are only a specific way to cluster your data. You might want your clustering algorithm to reproduce that view, or you might not. If you do, you can use external validity indices.
There are also other possible views of a data set. Have a look at alternative clustering and multiview clustering.
– Simone Nov 20 '18 at 14:41
1

Thank you @Simone for your help. I am reassured to know that there are no rule that says "if you have labels you must use supervised learning and not clustering" – learners Nov 20 '18 at 14:56
1

Yes, if you have labels you can still use clustering techniques to find other views of the data. – Simone Nov 20 '18 at 15:03

score 0 · Answer 2 · answered Nov 19 '18 at 17:34

Clustering is generally done for data which has no labels.

The Validation method you can use depends on the data and for the problem for which you are using for.

External indexes:- Can be used when your Clustering model will create a valid classes and you are able to make out the classes and hand label the data.

Internal indexes:- when the data is unknown and can not be labeled you can use this approach for validation.

Supervised Approach:- If you have data which is already labeled then of-course you can use supervised approaches.

score 0 · Answer 3 · answered May 19 '20 at 21:08

Data is just a bunch of measurements. What constitutes the "ground truth" and "external labels" is determined by people.

Take this data for example:

First image is of a wild dog, then some measurements of its characteristics
Second image is of a wild cat, then some measurements of its characteristics
Third image is of a domesticated cat, then some measurements of its characteristics

If you want to perform a supervised learning using you could choose to take wild vs domesticated as ground truth labels. Or you could also choose to take "dog vs cat" as ground truth labels. They are just labels.

If you don't necessarily want to train a machine learning model, but want to find out if there are some natural groupings, you might do clustering. Now, if you want to check if the grouping that "naturally" arose is animal species, you could check against "an outside label" "dog vs cat". Similarly, you could choose to use "wild vs domesticated".

I argue that labels are often made up notions in some contexts. Take the hypothetical image of a wild dog, for example, in an image classification context. There could be a mountain in the background of the wild dog, and, in a completely different image classification problem, the "ground truth label" for the same image could be "mountain" and not "ocean".

External validation of clustering requires labels, but why cluster at all if you have labels?

3 Answers3

Linked