1

I have a dataset with 11 variables and 80 000 observations.

I know 2 techniques to find evidence of clusters in a dataset: hierarchical clustering and k-means. I can't use the hierarchical clustering because the dataset is too large so I try the k-means approach.

With both my handmade function and the sklearn KMeans function, I get 4 clusters (I used the kneed.KneeLocator function to do so). My question in not about the technique, but more about whether this makes sense:

I find 4 clusters, great! But is that an evidence of the presence of clusters in the dataset? Wouldn't k-means have found an ideal number of clusters even if this was not clear?

The output

2 Answers2

0

$k$-means, as all the other clustering algorithms, are unsupervised methods. What this means is that they don’t use labels. It’s not only about not needing the labels, but you don’t need to assume that any such labels exist. The algorithms just group “similar” things together, using different definitions of what that means. There may be multiple different but equally good clustering solutions for the dataset. In many cases the choice between the solutions would be rather arbitrary. So you don’t need to know the number of clusters, often it is rather how many clusters you choose to use.

From the How to decide on the correct number of clusters? thread you can learn about tools that could help you with making the decision.

Tim
  • 138,066
  • Thanks @Tim! Ok, so the answer would be "we can group the observations into a multiple number of groups, and the aggregate distance assessment suggests that the most relevant number is 4. There may be multiple different but equally good clustering solutions for the dataset depending on the clustering and assessment methods. In many cases the choice between the solutions would be rather arbitrary because the -means, as all the other clustering algorithms, is an unsupervised methods" Am I right ? – Cornelius Cellier Nov 20 '21 at 14:54
  • @CorneliusCellier yes. For example, you could have 2, 3, & 4 solutions of similar quality, but choose 2 clusters because it’s the most easily interpretable and as so most useful. – Tim Nov 20 '21 at 15:05
  • Okay, thank you ! :) The dataset is a dataset of options and multiple financial ratios (not my speciality) but I'll try Karolis Koncevičius's idea to try visualizing with a PCA approach ! Thanks a lot ! – Cornelius Cellier Nov 20 '21 at 15:11
0

The fact that the used approach returns 4 as the optimal number of clusters does not imply that there are 4 separate groups of observations. In order to test this empirically you can generate a random dataset with 80,000 observations and 11 variables and repeat the procedure. I bet the function would still return an optimal number of clusters (maybe even 4), but since the data was generated randomly we would know that the actual number should be none (or 1).

Furthermore, k-means is based on euclidean distance and observations are clustered according to the closest centroid. Which means that it assumes the clusters to be of more or less equal sizes. So if in your data there is 2 huge cluster and 10 smaller ones, the smaller ones would likely not be differentiated into separate groups.

You can try other methods for determining how many clusters of separated groups of points there exist. What I would try first is doing principal component analysis and trying to visualise the scatter on the first 3 components, to see if there are areas of densely populated points, separated by less-densely populated borders.

However, the point made by @Tim still stands - clustering is often a subjective procedure and there might not be an objective way to select the number of clusters. As an example, consider the exercise of clustering animals in a zoo. We might cluster them based on the number of legs, or by color, or by height, or by what they eat, or by what their natural habitat is, or by how long they live, etc etc. All of these groupings would be different, yet all of them would also be valid. Same idea extends to the number of clusters. I might cluster animals based on whether they live in Europe, Africa, Americas, or Australia. And you might divide these continents further into north/south/east/west - giving more clusters. Yet we would both be right in our own way.

  • Thank you so much for your answer, this is crystal clear and I think that the topic can be closed! By the way, I was wondering, is that a thing to do the K-Means method with a distance metric that take the correlation into account (I think about the Mahalanobis distance). I'll try the PCA visualization immediately, thank you for the tip. – Cornelius Cellier Nov 20 '21 at 15:08
  • Glad it was helpful. On this site the question is closed by the person asking. In order to properly close the question you should accept one of the answers by clicking the "accept" check mark in the left top corner. And if you have a separate question (i.e. about using a distance that takes correlation into account) - you should search this site for answers first, and if you don't find any - then please post a separate question. – Karolis Koncevičius Nov 20 '21 at 15:11
  • Here is one that might be relevant: https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric/ – Karolis Koncevičius Nov 20 '21 at 15:18
  • 1
    Oh thank you ! I should have checked sorry it is a bad habit – Cornelius Cellier Nov 20 '21 at 15:23
  • @CorneliusCellier it's all good, compared to most new users you are doing great :) – Karolis Koncevičius Nov 20 '21 at 15:24