0

I need to build a cosine matrix (i.e. a matrix of cosine distances between every vector combination) for a vector set with 89,000 vectors of length 500, leading to a final 89,000x89,000 matrix. My current approach seems to be very inefficient, leading to very long processing times (e.g. using a vector set with 52,000 vectors of length 500 takes ~36 hours to build a 52,000x52,000 matrix).

My current solution uses R version 3.0.1 (2013-05-16), running on a 64bit version of ubuntu 13.10 on an Intel Core i7 4960X CPU @ 3.60GHz x 12 platform with 64GB RAM. Despite my using a 64-bit system, I still run into vector length errors thrown back from native sub-functions in R (e.g. Error: ... Too many indices (>2^31-1) for extraction); there does not seem to be a fix for that problem. As such, my current solution uses the big.matrix objects from the bigmemory package. I am also making use of the doParallel package to utilize all 12 processor cores on my workstation.

This is the code I am currently using:

setSize <- nrow(vectors_gw2014_FREQ_csMns) #i.e. =89,095

COSmatrix <- filebacked.big.matrix(
        #set dimensions and element value type
        setSize, setSize, init=0,
        type="double",
        backingpath = './COSmatrices',
        backingfile    = "cosMAT_gw2014_VARppmi.bak", 
        descriptorfile = "cosMAT_gw2014_VARppmi.dsc" 
        )

#initialize progress bar
pb <- txtProgressBar(min = 0, max = setSize, style = 3)
feErr <- foreach(i=1:setSize) %dopar%  {
    COSmatrix <- attach.big.matrix("./COSmatrices/cosMAT_gw2014_FREQ_csMns.dsc")
    setTxtProgressBar(pb, i)
    for (j in 1:setSize)
    {   
        if (j < i) 
        {
            COSmatrix[i,j] <- cosine(   as.vector(vectors_gw2014_FREQ_csMns[i,],mode="numeric"),
                                        as.vector(vectors_gw2014_FREQ_csMns[j,],mode="numeric") )

            COSmatrix[j,i] <- COSmatrix[i,j]

        }
        else break
    }#FOR j
}#FOREACH DOPAR i
close(pb)

I suspect that the main problem with my code—i.e. leading to excessive processing time—is the call to re-attach the big.matrix object in each iteration of the main foreach-loop:

COSmatrix <- attach.big.matrix("./COSmatrices/cosMAT_gw2014_FREQ_csMns.dsc")

However, this seems to be necessary in order to have access to a big.matrix object within a FOREACH (i.e. parallel processing feature from doparallel package) loop; without this line of code in the main loop, the COSmatrix object is inaccessible (see Using big.matrix in foreach loops).

I am looking for any and all suggestions for streamlining this process and cutting the processing time down from days to hours. This means I am open to using other approaches, either within R (i.e. using alternatives to the bigmemory package), or with a completely different toolset (i.e. python or C++ code). Please bear in mind that many (most?) of the commonly used R functions will not work with matrices of this size; I have explored many promising avenues only to run into the long vectors 32/64-bit limitation (i.e. Error: ... Too many indices (>2^31-1) for extraction; see Max Length for a Vector in R).

Cheers!

Community
  • 1
  • 1
Jeff Keith
  • 16
  • 2
  • First of all I suggest getting rid of "if" statement and replacing "for (j in 1:setSize)" with for "(j in 1:i)", but that still won't be a very significant optimization to be honest.. – Tomasz Posłuszny Aug 15 '14 at 23:54
  • I don't think that matrix will fit in the RAM space you have .... much less allow parallel processing with 12 processors, each of which needs space as well. – IRTFM Aug 16 '14 at 00:20
  • Just wondering, is the program CPU-bound? Looks like you could easily run out of memory on a matrix that size. Is it possible to partition the algorithm? – david.pfx Aug 16 '14 at 08:38
  • Thank you @Tomasz Posłuszny every little bit helps. – Jeff Keith Aug 18 '14 at 20:26
  • @BondedDust, if you look closer at the code (I should have mentioned this) I am using a file-backed (onto a 250GB solid state HDD) big matrix object to address that very issue; using the file-backing circumvents memory limitations, but comes at the cost of performance. – Jeff Keith Aug 18 '14 at 20:29
  • @david.pfx, I am not sure how to answer your question; it goes beyond my expertise. – Jeff Keith Aug 18 '14 at 20:29
  • To be clear, I am less interested in having my current approach "fixed", per se, and am more interested in any solution (i.e. even those that make no use of either R or my current approach) that would improve performance. – Jeff Keith Aug 18 '14 at 20:35
  • @JeffKeith: The point is that a computational program should be CPU bound. All CPU cores should sit on 100% CPU and get hot. If not then either (a) you are memory bound and limited by the speed of your I/O system for paging, not your CPUs or (b) you are not fully parallel and just wasting cycles. – david.pfx Aug 19 '14 at 04:43

0 Answers0