When using "silhouette coefficient" to evaluate an unsupervised model, do we need a labelled dataset?

Question

How does "silhouette coefficient" can find the optimal number of clusters when the dataset is not labelled? Does it need a labelled dataset or is it pure statistics?

I mean, when doing unsupervised learning, sometimes, it is used a labelled dataset to verify that the model is doing well. So, I would like to know if "silhouette coefficient" works by using a labelled dataset (even though we do not give it to the algorithm).

Thank you

score 0 · Answer 1 · answered Aug 19 '21 at 01:42

You can find a definition of the Silhouette coefficient on Wikipedia. I won't rehash the definition (let me know if you'd like that anyway), but from this, we see that the silhouette coefficient depends solely on the outputted clusters from any clustering algorithm, such that there is no need for a labeled dataset.

Of course, if you had a labeled dataset on hand, you could simply compute classification accuracy, F-score, or whatever classification metric you like directly, where each cluster corresponds to a predicted class. In such cases, I can't think off the top of my head why you'd use a fully unsupervised approach over a supervised one.

When using "silhouette coefficient" to evaluate an unsupervised model, do we need a labelled dataset?

1 Answers1