Intro
I built a clustering algorithm for a specific problem I have. The clustering algorithm wasn't my main goal, I just had to be able to separate the data into clusters prior to further processing, and I wanted to see if I could cluster the data so that I can later run my further processing on unlabeled data.
Some Details
It turns out that it worked great. However, my dataset might be too small an I might just overfit. So my question is can you point me toward standard benchmark datasets to compare my algorithm against?
Some more details: My algorithm is designed to cluster data of the following form:
- Each sample is a block (or sequence) of binary values.
- All samples are of the same size (same number of bits).
- Samples can be quite long (up to about a thousand bits each).
I presume that each sample originates from some (unknown) distribution and that each cluster originates from a different distribution. In a sense, my problem is very similar to a GMM model with a maximum likelihood of classification. The main differences are that the distribution isn't gaussian (variables are discrete), and the GMM would perform very poorly for distributions with a dimension of 1000 due to the curse of dimensionality. Somehow (I'm not sure yet exactly how), my algorithm circumferences this problem, and I wonder if it is because of the dataset.
My Questions
- Are there standard algorithms that try to accomplish the same goal or at least try to cluster binary sequences of relevant lengths that I can benchmark my algorithm against?
- Are there any datasets that I could use (binary sequences) in order to test my algorithm against further data?
run multidimensional scaling on them, and then run a GMM on the MDS outputimplies creating a points x features (euclidean dimensions) data out if a distance matrix, because GMM will call for such data as input. – ttnphns Oct 20 '22 at 16:03k-means can work well with binary sequencesIt could. And in text analysis, as you know, they often apply K-means to binary data normalized (i.e., k-means implied to be done on cosine similarity = on chord distance). The theoretical doubt remains: are we in the right to ever compute centroids directly in the "granular" space such as defined by binary features? Can and when "mean" is a valid concept for categorical data, including binary as categorical? – ttnphns Oct 20 '22 at 16:28