I am working on hourly-weather data. It contains four features: rain, wind speed, humidity, and temperature. Obviously, all of them are continuous values. The number of records is around 17000. Other than highly skewed precipitation (almost 90% are zero), all parameters are distributed normally. To cope with skewness, I added a one to all precipitation records and then executed a log and improved skewness a bit, but it is still highly skewed anyway.
After performing various preprocessing techniques (Standardization, PCA, ...), I want to determine the optimal number of clusters to benefit KMeans. I tested Silhouette, Gap Statistics, Elbow, and calinski_harabasz, and the numbers of clusters they identified are 2, 6, 8, and 3, respectively.
It is clear that the results of Silhouette and Calinski are not correct. Because it is impossible for weather conditions to be 2 or 3 types.
But, what about Elbow and Gap statistics? How can I determine the right number of clusters? 6 or 8?
EDIT
Edit 2
Although the elbow is more primitive than others, its suggestion seems much closer to real-world conditions. Indeed, the question was raised because we always require a technique to compare the results of Gap statistics (7 clusters) and Elbow (8 clusters). As you know, many clustering techniques need cluster numbers. In this case, luckily Silhouette and CH show unacceptable results, What if elbow=8, Silhouette =9, Gap = 10, and CH=11? so in, a real project, again you prescribe the same approach "Be open to the answer being "k-means cannot cluster this data set well"!! Probably, you come up with another solution!
The main advantage of this question could be sharing experiences about your approaches if you face a similar issue!!



