I have extensively researched the application of cross validation for unsupervised learning (as it is a requirement by my project manager) but it seems that there is no clear consensus as to how to put it in action.
I am currently working on a Clustering project and I have selected 2 algorithms I want to tune :
- K-means Clustering
- HDBSCAN.
The business need is the followings : separating different customer segments and identifying each segment characteristics.
My performance metric is DBCV (Density Based Clustering Validation), which I implemented through the HDBSCAN validity_index.
The first problem is that GridSearchCV, RandomSearchCV and BayesSearchCV are not compatible with scoring metrics that do not have ground truth (y_true), so I have to implemement cross validation and grid search manually.
Here is what I have for Kfold cross validation of my Kmeans algorithm (k as the only hyperparameter):
from sklearn.model_selection import KFold
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from hdbscan.validity import validity_index
scores = []
kf = KFold(n_splits=3)
for train_index, test_index in kf.split(X_train_tr):
X_train = X_train_tr[train_index]
X_test = X_train_tr[test_index]
for k in range (2,16):
print(k)
km = KMeans(n_clusters=k, random_state=10, n_init = 10)
km.fit(X_train)
cluster_labels = km.predict(X_test)
ssd = km.inertia_
sil = silhouette_score(X_test, cluster_labels)
val = validity_index(X_test, cluster_labels)
scores.append({'K': k, 'SSD': ssd, 'Silhouette': sil, 'Validity': val})
Manual Gridsearch with HDBSCAN :
import hdbscan
from sklearn.model_selection import ParameterGrid
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True)
specify parameters and distributions to sample from
param_grid = list(ParameterGrid(
{
'min_samples': [10,100,300,500,1000],
'min_cluster_size': [100,1000,3000,5000,10000,20000],
'cluster_selection_method': ['eom','leaf'],
'metric': ['euclidean','manhattan']
}))
scores = []
i = 0
kf = KFold(n_splits=3)
for train_index, test_index in kf.split(X_train_tr):
X_train = X_train_tr[train_index]
X_test = X_train_tr[test_index]
#Performing manual gridsearch
for params in param_grid:
print(i)
i+=1
hdb = hdbscan.HDBSCAN(prediction_data=True,
gen_min_span_tree=True,
min_samples=params['min_samples'],
min_cluster_size=params['min_cluster_size'],
cluster_selection_method=params['cluster_selection_method'],
metric=params['metric']).fit(X_train)
cluster_labels = hdbscan.approximate_predict(hdb,X_test)[0]
#sil = silhouette_score(X_test, cluster_labels)
val = validity_index(X_test, cluster_labels)
val_train = hdb.relative_validity_
scores.append({'min_samples': params['min_samples'],
'min_cluster_size': params['min_cluster_size'],
'cluster_selection_method': params['cluster_selection_method'],
'metric': params['metric'],
# 'Silhouette': sil,
'Validity Test': val,
'Validity Train': val_train})
scores_df = pd.DataFrame(scores).groupby(
["min_samples",'min_cluster_size','cluster_selection_method','metric']).mean().reset_index()
This is working, in the sense that it gives sensible results, but something in the back of my mind tells me that this is not right. The point of the model is to figure out the best clusters based on the whole dataset, not to predict the best clusters on new data points.
That being said, one of the business requirement was to have stable clusters, which explains why I think cross validation could help selecting the best model and hyperparameters.
So here are the 3 choices I'm pondering :
- Should I cross-validate based on the predicted clusters on my test set? (as implemented with the code above)
- Should I cross-validate only based on the performance of the clustering on the train set? (using the different splits to make sure my model is consistent)
- Should I use another way of cross validating my results?
