1

I am working on hourly-weather data. It contains four features: rain, wind speed, humidity, and temperature. Obviously, all of them are continuous values. The number of records is around 17000. Other than highly skewed precipitation (almost 90% are zero), all parameters are distributed normally. To cope with skewness, I added a one to all precipitation records and then executed a log and improved skewness a bit, but it is still highly skewed anyway.

After performing various preprocessing techniques (Standardization, PCA, ...), I want to determine the optimal number of clusters to benefit KMeans. I tested Silhouette, Gap Statistics, Elbow, and calinski_harabasz, and the numbers of clusters they identified are 2, 6, 8, and 3, respectively.

It is clear that the results of Silhouette and Calinski are not correct. Because it is impossible for weather conditions to be 2 or 3 types.

But, what about Elbow and Gap statistics? How can I determine the right number of clusters? 6 or 8?

EDIT

enter image description here

enter image description here

enter image description here

enter image description here

Edit 2

Although the elbow is more primitive than others, its suggestion seems much closer to real-world conditions. Indeed, the question was raised because we always require a technique to compare the results of Gap statistics (7 clusters) and Elbow (8 clusters). As you know, many clustering techniques need cluster numbers. In this case, luckily Silhouette and CH show unacceptable results, What if elbow=8, Silhouette =9, Gap = 10, and CH=11? so in, a real project, again you prescribe the same approach "Be open to the answer being "k-means cannot cluster this data set well"!! Probably, you come up with another solution!

The main advantage of this question could be sharing experiences about your approaches if you face a similar issue!!

Asa Ya
  • 73
  • 2
    There are lots of ways to select the number of clusters, none of them conclusive wrt finding the 'right' or true number. One of the historic objections to clustering is the subjective or nonempirical nature of the chosen results. Solutions can be motivated while also being easily falsified. K-means always finds a solution, even when no clusters exist, a big caveat to this method. Have you compared the frequency distributions, does the larger solution have small size partitions? That would be one good reason for dropping the larger solution. – user78229 Apr 04 '23 at 12:09
  • You are right, kmeans faces cons, but in this situation, I have no objections to it. I am looking for an approach to finding cluster numbers which is an underlying basis for many clustering approaches. By the way, yes, I have compared them, and objects are distributed in the clusters for both cluster numbers, so similar! although, it cannot be a good reason to reject one of them. – Asa Ya Apr 04 '23 at 12:43
  • Fair enough. As noted, motivating your solution is the best way to present the results to any audience. How tough is yours? If you demonstrate that you've considered a range of methods and metrics, will that be enough? For instance, using a log transform is only one possible transformation, others include inverse sine or arcsine function which greatly compresses skewed data, much more so than the log. Other transformations include range normalization or ipsative scaling whereby each of your weather metrics are relativized by dividing by their maximum values. – user78229 Apr 04 '23 at 12:50
  • @ttnphns I didn't quiet get your comments. Sorry! – Asa Ya Apr 04 '23 at 13:01
  • 1
    Please post pictures in your question. For each of the 4 criteria, the plot of its values against a range of different number of clusters k. It is strange to post a question without such pics illustrating your results. – ttnphns Apr 04 '23 at 13:05
  • @MikeHunter, if I am testing various techniques to find the optimal numbers, I must know the optimal numbers! cluster numbers can affect the next steps of my research project. It is an issue! I evaluated other preprocessing techniques too, but I supposed even if both techniques gave me the same cluster results, I need to be sure the result of clustering makes sense! not by cluster validation techniques but by content analysis. Honestly, I am thinking about content analysis techniques, then I welcome to any idea about a statistics approach to show my audience which one is better!!!! – Asa Ya Apr 04 '23 at 13:14
  • @ttnphns it is done – Asa Ya Apr 04 '23 at 13:24
  • Nice. My overall subjective impression that you should check (compare for interpretability, maybe compare also for cross-validity on subsamples) all solutions with k= 4 through 7. SS elbow method is more primitive than its kin Calinski-Harabasz, so you may disregard it. Your clusters, though, are not well pronounced, they are not clear-cut. – ttnphns Apr 04 '23 at 13:50
  • 1
    With your permission, I'll live a pair of links to my answers that you might find relevant to the theme of internal cluster validation. https://stats.stackexchange.com/a/358937/3277, https://stats.stackexchange.com/a/195481/3277. – ttnphns Apr 04 '23 at 13:53
  • 1
    Since you are using K-means which is about partitioning about cluster centroids (means), I would recommend you to try not the classic version of Silhouette criterion but the version called "deviation type" of "simplified type" Silhouette. – ttnphns Apr 04 '23 at 13:57
  • Cube root is a common transformation for rainfall, and less arbitrary than log(amount + 1). For example, does 1 mean 1 mm or 1 inch, as those transformations are not identical? But no matter what you do, a spike at zero will remain a spike when transformed and at one end of your distribution. Apart possibly from re-discovering the difference between raining and not raining, the exercise seems doomed to finding arbitrary clusters. – Nick Cox Apr 07 '23 at 07:43
  • I previously didn't notice that your silhouette is much higher for k=2 than for k>2. While gap and calinski don't support that. A bit strange. – ttnphns Apr 08 '23 at 08:25
  • "I want to determine the optimal number of clusters to benefit KMeans....It is clear that the results of Silhouette and Calinski are not correct. Because it is impossible for weather conditions to be 2 or 3 types." This is not clear. What sort of clustering are you performing (with which expression of datapoints exactly?), what does it have to do with weather being 2 or 3 types? – Sextus Empiricus Apr 10 '23 at 19:10
  • @SextusEmpiricus to initialize kmeans I benefited mentioned approaches. Of course, I used AHC, but cutting the level in the dendrogram is another challenge. I am living in this city for many years, it is completely clear the hourly weather conditions cannot be categorized into three clusters. – Asa Ya Apr 11 '23 at 23:03

1 Answers1

3

Probably none of them is "correct", because of your data.

  1. there is no elbow, this is pretty much the expected behavior on random data.
  2. all Silhouette scores for k>2 are very low, so none of these results is good
  3. C-H seems to max out at 6, why do you choose 4?

In particular when the methods disagree and do not give clear indications, this usually means that simply none of the results is good!

See my preprint:

Schubert, Erich. "Stop using the elbow criterion for k-means and how to choose the number of clusters instead." arXiv preprint arXiv:2212.12189 (2022). https://arxiv.org/abs/2212.12189

and pay attention in particular on the section titled "The true challenges of k-means" and Figure 4 (because the earlier results are on easy data sets). You do not need to choose k if k-means cannot solve your problem - have you considered that your data does not contain k-means type of clusters? Be open to the answer being "k-means cannot cluster this data set well".

As you are using PCA, beware that PCA may even destroy some signal. Plot the data. If you cannot identify clusters in your plot, k-means probably cannot, either.

  • 1
    We appreciate you contributing your expertise! Please note that if you are the author of a paper cited in an answer, you must disclose your affiliation. https://meta.stackexchange.com/a/59302/320588 (Having the same username as the author is not sufficient; you must state that you are the author in the answer.) – Sycorax Apr 07 '23 at 15:01
  • 1
    when the methods disagree and do not give clear indications, this usually means that simply none of the results is good I wouldn't be so much categorical. Only for really clear-cut, well-apart clusters all or majority if clustering criteria will give their "peaks" or "elbows" at the same number of k. In real studies, usually we deal with modestly mound data where different criteria will disagree. It is important to select one type of criterion which is "isomorphic" to your idea of a cluster and is well suited for the data, and to believe only it. (Plus other methods if validation.) – ttnphns Apr 08 '23 at 07:48
  • The OP's Silhouette value at k=2 is indeed much bigger than for k>2 and is about 0.65 (and I previously completely overlooked it). One might expect that other criteria would support this finding. They did not. Yes, a bit strange. It would be interesting to look at the OP's data and clustering results. – ttnphns Apr 08 '23 at 08:00
  • 1
    That is probably precipitation 0 and precipitation != 0. So the result will be "correct" but not useful. – Erich Schubert Apr 10 '23 at 18:42
  • @ErichSchubert thank you for sharing your paper and leaving an informative response. I read your paper. I cannot GMM because of the precipitation skewness. I checked other approaches like AHC, Affinity Propagation, .... . I think their results were similar, all of them separated temperature perfectly but not other features. In this case, I think the elbow suggests a more realistic cluster number than others because I know in the city where I live the weather conditions can be at least more than 7 clusters. – Asa Ya Apr 11 '23 at 23:15
  • Thank you for all your replies. It is a real case although the elbow is more primitive than others it seems its suggestion is much closer to real-world conditions. The question raised because I looked for a technique to compare the results of Gap statistics (7 clusters) and Elbow (8 clusters). Anyway, many clustering techniques need cluster numbers to be turned on. What if elbow=8, Silhouette =9, and CH=12, so again in this situation you prescribe the same approach "Be open to the answer being "k-means cannot cluster this data set well"!! Probably, you come up with another solution! – Asa Ya Apr 12 '23 at 03:14
  • If data is skewed, k-means is not the right tool. Preprocess your data better? Make sure to visualize your data. If you cannot "see" clusters, chances are that k-means will not work either - a quite good rule of thumb. Add visualizations of your data to your question maybe? – Erich Schubert Jun 06 '23 at 15:20