I have just start learning about Machine Learning and while surfing on the web, I saw that another CV user in those post has offered Markov cluster algorithms to cluster long strings. As far as I know, MCL is cluster algorithm for graphs, how could it be used to cluster strings? Should it be modified for this purpose, or naturally it can also takes strings as input?
Asked
Active
Viewed 2,314 times
2 Answers
1
You might consider the original two approaches for analyzing strings in text mining based on 1) stemming and stopping and 2) n-grams. I have had a great deal of success using n-grams on peptide strings (of amino acids, AA) and then clustering the results from n-grams for QSAR (quantitative structural activity relationship) between molecules. (Look at, e.g., SMILES strings for molecular characterization of molecules).
Would not recommend focusing on Markov anything until you understand the basics.
0
Using Markov clustering to cluster by words is fairly easy, using this module
Simply run:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
X = CountVectorizer(max_df=10**-2, min_df=10**-7).fit_transform(docs)
X = TfidfTransformer(use_idf=False).fit_transform(X)
clusters = mcl(X).run().clusters()
Where docs is an array of your strings
Uri Goren
- 1,811
"Use graph clustering algorithms, such as Louvain clustering, Restricted Neighbourhood Search Clustering (RNSC), Affinity Propgation Clustering (APC), or the Markov Cluster algorithm (MCL"
– stackunderflow Apr 11 '15 at 20:06