Benchmarking a clustering algorithm

Question

Intro

I built a clustering algorithm for a specific problem I have. The clustering algorithm wasn't my main goal, I just had to be able to separate the data into clusters prior to further processing, and I wanted to see if I could cluster the data so that I can later run my further processing on unlabeled data.

Some Details

It turns out that it worked great. However, my dataset might be too small an I might just overfit. So my question is can you point me toward standard benchmark datasets to compare my algorithm against?

Some more details: My algorithm is designed to cluster data of the following form:

Each sample is a block (or sequence) of binary values.
All samples are of the same size (same number of bits).
Samples can be quite long (up to about a thousand bits each).

I presume that each sample originates from some (unknown) distribution and that each cluster originates from a different distribution. In a sense, my problem is very similar to a GMM model with a maximum likelihood of classification. The main differences are that the distribution isn't gaussian (variables are discrete), and the GMM would perform very poorly for distributions with a dimension of 1000 due to the curse of dimensionality. Somehow (I'm not sure yet exactly how), my algorithm circumferences this problem, and I wonder if it is because of the dataset.

My Questions

Are there standard algorithms that try to accomplish the same goal or at least try to cluster binary sequences of relevant lengths that I can benchmark my algorithm against?
Are there any datasets that I could use (binary sequences) in order to test my algorithm against further data?

Christian Hennig · Accepted Answer · 2022-10-20T16:10:06.433

2

Alternative methods: Any distance-based method - such as Single, Complete, Average Linkage of Hierarchical clustering or Partitioning Around Medoids clustering - can be used with an appropriate distance on binary sequences, such as Simple Matching or Jaccard. There is also a standard mixture model for such data that is usually called latent class model, see https://en.wikipedia.org/wiki/Latent_class_model. Clusters are modelled as subsets within which variables are independent ("local independence"). The R-package poLCA can do this. One could also use distances, run multidimensional scaling on them, and then run a GMM on the MDS output as is done in our R-package prabclus, function prabclust. There's also BayesLCA for Bayesian latent class analysis.

I add that somebody once told me that despite originally being designed for continuous data, k-means can work well with binary sequences (if the dimension is not very low - not sure about very big). I think I've never tried it, but this person may be right and it could be worth a try.

The prabclus-package has example data sets veronica and kykladspecreg, although both come without ground truth in the package. (I have something of a ground truth vector for veronica, and you can contact me off site about it, however these "true species" are not necessarily reliable.) poLCA also has example data sets, but I don't know whether these have ground truth information.

Some general considerations regarding cluster benchmarking are here: https://arxiv.org/abs/1809.10496 You may also think about generating artificial data with known truth for benchmarking.

edited Oct 20 '22 at 16:10

answered Oct 19 '22 at 21:44

Christian Hennig

23,655

Thanks for the detailed answer! You’ve given me some good HW :) I’ll go over these and perhaps contact you offsite as suggested. – Yair M Oct 20 '22 at 15:35
Very nice answer. Just to remark, for a reader not well versed, that run multidimensional scaling on them, and then run a GMM on the MDS output implies creating a points x features (euclidean dimensions) data out if a distance matrix, because GMM will call for such data as input. – ttnphns Oct 20 '22 at 16:03
k-means can work well with binary sequences It could. And in text analysis, as you know, they often apply K-means to binary data normalized (i.e., k-means implied to be done on cosine similarity = on chord distance). The theoretical doubt remains: are we in the right to ever compute centroids directly in the "granular" space such as defined by binary features? Can and when "mean" is a valid concept for categorical data, including binary as categorical? – ttnphns Oct 20 '22 at 16:28
@ttnphns Means of binary variables are relative frequencies, therefore estimated probabilities. I think the problem is if we have several dummy variables for categories with originally more than two values, as k-means treats the variables as independent, which they are not in this case. – Christian Hennig Oct 27 '22 at 10:59
Christian, the point that the mean of a binary variable has a (probabilistic) meaning does not automatically and happily make the dichotomous scale suitable for k-means or other analysis of metric scales: the meaning is an epi-phenomenon, a statistic. I'm just touching the question here and here. – ttnphns Oct 27 '22 at 13:37

Benchmarking a clustering algorithm

Intro

Some Details

My Questions

1 Answers1