As a part of my studies, I’m trying to cluster co-occurrences of URLs and tags in data from Delicious. I found a promising method for this in a paper called “Emergent Semantics from Folksonomies: A Quantitative Study” (pages 6-13). It used a Separable Mixture Model (SMM, described in the paper “Statistical Models for Co-occurrence Data” pages 2-4) to model the data and an adapted EM-algorithm to fit the known data to the model.
I coded the algorithm with Java and ran it with a little piece of real data from Delicious. Unfortunately, the results did not seem correct. The results showed that each tag had equal (although varying from tag to tag) possibility to belong to each concept.
Now, while this problem could have came from me simply coding the adapted EM-algorithm wrong, I would also like to rule out the possibility of incorrectly initialized variables. This time, since I didn’t know any better way to do it, I simply initialized all the $R_{r\alpha}$ (variables that denote the possibility of co-occurrence $r$ to have raised from concept $\alpha$) to be equal, $1/K$ ($K$ being the number of concepts).
My question is two-fold.
Could the flat results come from the flat variable initialization?
How should I initialize the variables from the EM-algorithm in this case?