How can Markov cluster algorithms be used to cluster strings?

Question

I have just start learning about Machine Learning and while surfing on the web, I saw that another CV user in those post has offered Markov cluster algorithms to cluster long strings. As far as I know, MCL is cluster algorithm for graphs, how could it be used to cluster strings? Should it be modified for this purpose, or naturally it can also takes strings as input?

MCL = Markov Cluster Algorithm, The answer that I've found can be accessed from here : http://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups
"Use graph clustering algorithms, such as Louvain clustering, Restricted Neighbourhood Search Clustering (RNSC), Affinity Propgation Clustering (APC), or the Markov Cluster algorithm (MCL" — stackunderflow, Apr 11 '15 at 20:06
I am that guy. MCL is used a lot for clustering proteins based on their amino acid sequence (which is just a string over alphabet of size 20). In that context, the input to mcl consists of triples ID1 ID2 LOG-SCORE, where LOG-SCORE is an application-specific measure of similarity (derived from BLAST). So MCL will not compute similarity (derived from a dissimilarity such as edit-distance) between strings for you. It requires you to feed it a list of string ID pairs together with a similarity for each of those pairs. — micans, Apr 13 '15 at 13:39
@micans thanks for reply, My problem is that, for a given chunk of genes, I have to cluster them based on their sequence similarities, and I want to do this clustering operation with MCL. Do you think it is possible? — stackunderflow, Apr 15 '15 at 08:29
It depends on what your notion of similarity is. Are these protein coding genes? lncrnas? How do you want to measure similarity? If they are protein coding then the answer is unequivocally yes (and you should derive the similiary from blastp-type E-values or bit scores). If they are not it would be good to know what type of genes they are. To be sure, with 'chunk of genes' you just mean 'a number of genes'? How many? What is the purpose of doing cluster analysis? Your notion of gene similarity and how you will compute it is crucial in this - MCL will not do that for you. — micans, Apr 15 '15 at 09:41
In case you try to use the code someone posted on July 6 for the Markov Clustering, Python doesn't recognize the function mcl. Also checkout this to understand how to tune max and min df https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer — Gabriel Alon, Aug 15 '18 at 18:37

score 1 · Answer 1 · answered Apr 11 '15 at 22:57

You might consider the original two approaches for analyzing strings in text mining based on 1) stemming and stopping and 2) n-grams. I have had a great deal of success using n-grams on peptide strings (of amino acids, AA) and then clustering the results from n-grams for QSAR (quantitative structural activity relationship) between molecules. (Look at, e.g., SMILES strings for molecular characterization of molecules).

Would not recommend focusing on Markov anything until you understand the basics.

score 0 · Answer 2 · answered Jul 06 '17 at 20:45

Using Markov clustering to cluster by words is fairly easy, using this module

Simply run:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
X = CountVectorizer(max_df=10**-2, min_df=10**-7).fit_transform(docs)
X = TfidfTransformer(use_idf=False).fit_transform(X)
clusters = mcl(X).run().clusters()

Where docs is an array of your strings

How can Markov cluster algorithms be used to cluster strings?

2 Answers2