Implementing cross-validation to tune the hyperparameters of an unsupervised model

Question

I have extensively researched the application of cross validation for unsupervised learning (as it is a requirement by my project manager) but it seems that there is no clear consensus as to how to put it in action.

I am currently working on a Clustering project and I have selected 2 algorithms I want to tune :

K-means Clustering
HDBSCAN.

The business need is the followings : separating different customer segments and identifying each segment characteristics.

My performance metric is DBCV (Density Based Clustering Validation), which I implemented through the HDBSCAN validity_index.

The first problem is that GridSearchCV, RandomSearchCV and BayesSearchCV are not compatible with scoring metrics that do not have ground truth (y_true), so I have to implemement cross validation and grid search manually.

Here is what I have for Kfold cross validation of my Kmeans algorithm (k as the only hyperparameter):

from sklearn.model_selection import KFold
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from hdbscan.validity import validity_index
scores = []
kf = KFold(n_splits=3)
for train_index, test_index in kf.split(X_train_tr):
    X_train = X_train_tr[train_index]
    X_test = X_train_tr[test_index]
    for k in range (2,16):
        print(k)
        km = KMeans(n_clusters=k, random_state=10, n_init = 10)
        km.fit(X_train)
        cluster_labels = km.predict(X_test)
        ssd = km.inertia_
        sil = silhouette_score(X_test, cluster_labels)
        val = validity_index(X_test, cluster_labels)
    scores.append({'K': k, 'SSD': ssd, 'Silhouette': sil, 'Validity': val})

Manual Gridsearch with HDBSCAN :

import hdbscan
from sklearn.model_selection import ParameterGrid
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True)
specify parameters and distributions to sample from
param_grid = list(ParameterGrid(
    {
    'min_samples': [10,100,300,500,1000],
    'min_cluster_size': [100,1000,3000,5000,10000,20000],
    'cluster_selection_method': ['eom','leaf'],
    'metric': ['euclidean','manhattan']
}))
scores = []
i = 0
kf = KFold(n_splits=3)
for train_index, test_index in kf.split(X_train_tr):
    X_train = X_train_tr[train_index]
    X_test = X_train_tr[test_index]
    #Performing manual gridsearch
    for params in param_grid:
        print(i)
        i+=1
        hdb = hdbscan.HDBSCAN(prediction_data=True,
                              gen_min_span_tree=True,
                              min_samples=params['min_samples'],
                              min_cluster_size=params['min_cluster_size'],
                              cluster_selection_method=params['cluster_selection_method'],
                              metric=params['metric']).fit(X_train)
    cluster_labels = hdbscan.approximate_predict(hdb,X_test)[0]
    #sil = silhouette_score(X_test, cluster_labels)
    val = validity_index(X_test, cluster_labels)
    val_train = hdb.relative_validity_

    scores.append({'min_samples': params['min_samples'],
                   'min_cluster_size': params['min_cluster_size'],
                   'cluster_selection_method': params['cluster_selection_method'],
                   'metric': params['metric'],
                  # 'Silhouette': sil,
                   'Validity Test': val,
                  'Validity Train': val_train})


scores_df = pd.DataFrame(scores).groupby(
    ["min_samples",'min_cluster_size','cluster_selection_method','metric']).mean().reset_index()

This is working, in the sense that it gives sensible results, but something in the back of my mind tells me that this is not right. The point of the model is to figure out the best clusters based on the whole dataset, not to predict the best clusters on new data points.

That being said, one of the business requirement was to have stable clusters, which explains why I think cross validation could help selecting the best model and hyperparameters.

So here are the 3 choices I'm pondering :

Should I cross-validate based on the predicted clusters on my test set? (as implemented with the code above)
Should I cross-validate only based on the performance of the clustering on the train set? (using the different splits to make sure my model is consistent)
Should I use another way of cross validating my results?

Patrick Bormann · Accepted Answer · 2022-10-04T18:33:51.837

I believe there is a flaw in your concept of best clusters. Best clusters is already a wrong term in unsupervised ml. Since you do not know how clusters look like, there is no best cluster, there are just clusters at first, which you can test against your hypotheses, what attributes a cluster can or may contain. E.g. a cluster named smart shoppers, should have users who spend more than $5000 a year, redeem 20 coupons and so on.

However, if you already know how your data should be grouped (segments identified by business), why are you trying to justify your assumptions by a natural clustering approach?

If you want to validate your results of kmeans you can use the elbow method: https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/, in a way that a certain amount of clusters reduce the (eucldiean) distance with big portions. Silhouette Score goes for quality of your clusters, as you can measure multiple criteria and you get a silhouette plot, to reflects shapes of the clusters more. Instead of just looking at the minimizing of a value.

https://vitalflux.com/elbow-method-silhouette-score-which-better/#:~:text=The%20major%20difference%20between%20elbow%20and%20silhouette%20scores,for%20datasets%20with%20smaller%20size%20or%20time%20complexity.

See also: https://pberba.github.io/stats/2020/01/17/hdbscan/

Additionally, I would say that stable clusters depend on less features. it is like multicollinearity. The more features you have, the more volatile the whole system is to change. And this is to some extent true to clusters also.

Update

You can apply Silhouette also to HDBSCAN by the way. Again, there is no ground truth you can compare your clusters against.

The only thing you could do to stabilize your clusters is by another model. For this use the cluster number as a categorical target. Then use your features to create a model that can forecast a member of a specific cluster. If this model is then stable, then you probably have what you wanted, because your features are somehow reliable, if you get then new data, you can forecast their belonging to a group and always compare this forecasted cluster against a unsupervised "reclustering" approach. So make your clustering and use the clusters for a classifier target method. This will help in stabilizing. Keep in mind that for this technqiue clusters sizes do not have to be equal. Some ppl think that for classification tasks we need to resample the minority class. This is not the case.

Thank you for your thoughtful answer. I probably wasn’t clear in my initial post but i have no idea what the clusters should look like. My task is to create stable clusters based on a dataset. Rather than finding the best clusters which I totally agree is a flawed concept, I should rather have said find the unsupervised model and hyperparameters that optimize the stability and density of my clusters. I have already plotted the silhouette score and Elbow curve for the Kmeans but I struggle to see how I can apply cross validation to improve this hyperparameter setting — Octave, Oct 04 '22 at 04:43
And how I can use cross validation to compare my 2 models (in that case HDBSCAN and Kmeans) — Octave, Oct 04 '22 at 04:44
I updated my answer, I hope it helps, if yes, please upvote. — Patrick Bormann, Oct 04 '22 at 18:34
Thanks your edit is particularly interesting I think this could be the way to move forward, I ended up performing cross validation and then computing the silhouette score on the train (CV) split to make sure that the results I got were not due to chance. I also computed cluster stability as defined here which is a nice alternative to increasing the complexity of the model and computation time (the dataset is huge) by adding a classifier to test for stability : https://amueller.github.io/aml/04-model-evaluation/17-cluster-evaluation.html — Octave, Oct 05 '22 at 05:54

score 0 · Answer 2 · answered Oct 05 '22 at 06:15

So I've reached a tentative answer to my problem, which I don't claim is better than @PatrickBormann 's edited answer but allowed me to reach sensible conclusions, which I'm sharing now for other people.

The best way in my opinion to perform the hyperparameter tuning of an unsupervised machine learning algorithm (when stability is a concern) is the method outlined here How to perform Validation on Unsupervised learning? which is described in the following figure (credits to @Vincentdebakker) :

I have used the implementation of Predictive Strength (PS) described in the 100 page ML book Github (https://github.com/aburkov/theMLbook/blob/master/prediction_strength.py).

The problem for my dataset is that Prediction Strength as it is implemented in Python takes a very long time to compute. My (pretty performant) computer was not able to predict a single prediction strength in 2 hours, so for my dataset of ~100k samples and 33 features it would not be feasible to use that metric during cross validation.

I have resorted to using the method outlined here : https://amueller.github.io/aml/04-model-evaluation/17-cluster-evaluation.html which gave me sensible results and allowed me to optimize the DBSCAN relatively quickly.

Here is my implementation with a manual gridsearch for DBSCAN :

from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import KFold
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, adjusted_rand_score
import gc
#Specify parameters to sample from
param_grid = list(ParameterGrid(
    {
    'min_samples': [200,300,400,500,700,900],
    'eps' : [0.25,0.35,0.45,0.55,0.65],
    'metric': ['euclidean','manhattan']
}))
scores = []
kf = KFold(n_splits=3)
SIL = []
i = 0
#Performing manual gridsearch
for params in param_grid:
labels = []
indices = []
scores_stab = []
SIL = []

for train_index, test_index in kf.split(X_train_tr):
    X_train = X_train_tr[train_index]
    X_test = X_train_tr[test_index]

    dbs = DBSCAN(n_jobs=-1,
                  min_samples=params['min_samples'],
                  eps=params['eps'],
                  metric=params['metric']).fit(X_train)
    clusters = dbs.labels_
    sil = silhouette_score(X_train, clusters)
    SIL.append(sil)

    #Calculating stability on the train set
    indices.append(train_index)
    relabel = -np.ones(X_train_tr.shape[0], dtype=np.int32)
    relabel[train_index] = clusters
    labels.append(relabel)

    min_samples_pct = params['min_samples'] / len(X_train)

    gc.enable()
    del X_train, X_test
    gc.collect()

for l, i in zip(labels, indices):
    for m, j in zip(labels, indices):
    # we also compute the diagonal
        in_both = np.intersect1d(i, j)
        scores_stab.append(adjusted_rand_score(l[in_both], m[in_both]))

print(&quot;min_samples: {}, eps: {}, metric: {} ==&gt;  Silhouette: {}, Stability {}&quot;.format(
params['min_samples'], params['eps'], params['metric'], np.mean(SIL), np.mean(scores_stab)))        
scores.append({'min_samples': params['min_samples'], 'min_samples_pct': min_samples_pct,'eps': params['eps'],
               'metric': params['metric'],'Silhouette': np.mean(SIL), 'Stability': np.mean(scores_stab)})


scores_dbs = pd.DataFrame(scores)
scores_dbs

It should be underlined that the stability calculation as defined in this line of code is not as rigorous as the cross validation based on 10 iterations defined in the link above, but I think this is still a good approximation that can be merged within the cross validation. I may be wrong though because the clusters in my dataset are quite stable so maybe this implementation would fail with less cleanly separated clusters.

Important note the min_samples hyperparameter defined here is relative to the number of samples it is calculated on (here 2/3 of my initial X_train (3 folds) ,and I kept 10% of my initial dataset as a test set). That's why it's important to calculate the min_samples to dataset length ratio that will then be multiplied by the length of the whole dataset when applying the clustering algorithm to my initial dataset.

Implementing cross-validation to tune the hyperparameters of an unsupervised model

specify parameters and distributions to sample from

2 Answers2