9

I am trying to do some k-means clustering on a very large matrix.

The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).

The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.

Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.

Many thanks

movingabout
  • 309
  • 3
  • 8
  • Thanks for the answer! I got another question though :-) I am trying to run bigkmeans with a cluster number of about 2000 e.g "clust – movingabout Jun 18 '10 at 07:49
  • 1
    Original at http://stackoverflow.com/questions/3177827/clustering-on-very-large-sparse-matrix – Andrew Dalke Dec 20 '11 at 20:04

4 Answers4

14

The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.

Dirk Eddelbuettel
  • 347,098
  • 55
  • 623
  • 708
1

sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.

http://cran.r-project.org/web/packages/sparcl/sparcl.pdf

==

For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

https://spark.apache.org/docs/latest/mllib-clustering.html

MichalO
  • 21
  • 4
1

Please check:

library(foreign)
?read.arff

Cheers.

joran
  • 163,977
  • 32
  • 423
  • 453
0

There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

Olga Mu
  • 902
  • 2
  • 12
  • 23