Clustering high-dimensional sparse binary data

Question

I am trying to cluster Facebook users based on their likes.

I have two problems: First, since there is no dislike in Facebook all I have is having likes (1) for some items but for the rest of the items, the value is unknown and not necessarily zero (corresponding to a dislike). If use 0 for unknowns, then I think my clusters will be biased. Any suggestion?

Second, supposed I assign 0 to unknown items and cluster them, using a hierarchichal clustering method using a binary measure distance such as Jaccard, Tanimoto,...

How can I evaluate the clustering results? The within and outside SSE is not appropriate for binary data. If I use median centers, I m afraid most of them are going to be zero as I have a sparse feature matrix. So what would be a good way to evaluate the clusters?

I don't think clustering is going to be useful here. You might look into network analysis. — Peter Flom, Nov 28 '12 at 21:39
I wonder if correspondence analysis would be helpful for you. You can find a brief guide on Quick-R. — gung - Reinstate Monica, Nov 28 '12 at 22:09
@user1848018 setting the unknown user-item-combinations to 0 is fine. 0 of course means only "no like given yet" and allows no differentation between "does not like" and "unknown", but this is no issue, since Jaccard et. all define similarity of two users by the overlap of items both users like. — steffen, Nov 29 '12 at 07:46
Are suggestions for graph or network approaches inspired by the fact that this is Facebook data? //
Let's pretend our data is high dimensional and sparse and binary -- but the data and our objectives are completely unrelated to SNA. We want to cluster objects based on a large number of binary features. What's the approach, and is it possible to address the OP's concern about confusing '0' with 'NaN's? — Aman, Dec 03 '12 at 18:29

score 3 · Answer 1 · answered Nov 28 '12 at 23:36

3

Consider using a graph based approach.

Try to find a threshold to define when users are "somewhat similar". It can be quite low. Build a graph of these somewhat similar users.

Then use a Clique detection approach to find groups in this graph.

answered Nov 28 '12 at 23:36

Has QUIT--Anony-Mousse

42,358

(+1) yes, graph clustering is the way to go. I recommend this survey (Graph clustering by Satu Elisa Schaeffer), which also deals with the evaluation of clusters (see 4.2). – steffen Nov 29 '12 at 07:34
I agree with the general approach (and suggest using e.g. tanimoto similarity between 'like' sets), but why specifically Clique detection? It's one of the (types of) algorithms I favour least -- the definition of a cluster is very sensitive to removal of a single edge. – micans Nov 29 '12 at 10:02
Just as a starter to point him towards graph based methods instead of vector data point clustering. Starting with Clique he will surely discover approximate cliques etc. – Has QUIT--Anony-Mousse Nov 29 '12 at 11:53
@Anony-Mousse thanks for always providing good feedback, would a graph based approach, scale for large data. For traditional clustering I could use MAHOUT which is claimed to scale well so how about graph base approach. I already know R has quite a few packages, any suggestion? R-base or Java base – user1848018 Nov 29 '12 at 16:25
@micans, I liked your comment about the cliques, but would be an alternative to a clique detection approach? Would it be A model base statistical approach? – user1848018 Nov 29 '12 at 16:28
I come at it from a different angle entirely. The algorithms I would recommend are RNSC, Louvain, and MCL; these are all very scalable. Caveat: mcl was written by me, so I am biased. With an approach like this the threshold will be quite important. I would construct a graph based on a low threshold (assuming you use a similarity), and investigate how traits (such as clustering coefficient and node degree) vary when adjusting the threshold, then pick a threshold that suggests inclusiveness (e.g. not too many singletons) but limits noise. – micans Nov 29 '12 at 16:40

score 1 · Answer 2 · answered Nov 29 '12 at 10:29

I suggest a cluster analysis. Joachim Bacher discusses the different dissimilarity coefficients in depth in his script about cluster analysis, in particular the effects of treating the absence of a treat. For instance : are two items correlated when both show zero ? I remember that he also works with a multiple-response example from survey research which is close to your problem.

The script can be downloaded from : http://www.clusteranalyse.net/sonstiges/zaspringseminar2002/

HTH ftr

FYI, in case you followed the link and hit the back button b/c the web page is in German, Johann Bacher's posted lecture notes are in English. — Aman, Nov 29 '12 at 17:15

Clustering high-dimensional sparse binary data

2 Answers2

Linked