What is the best way to determine the number of non zeros in sparse matrix multiplication?

Question

I was wondering whether there is a fast and efficient method to find the number of non zeros in advance for sparse matrix multiplication operation assuming both matrices are in CSC or CSR format.

I know there is one in smmp package but I need something that is already implemented in C or C++.

Any help will be appreciated. Thanks in advance.

do your matrices have any symmetry, or a structure to the location of their non-zero entries? — Godric Seer, Jul 17 '12 at 16:56
@GodricSeer...no I am just talking about general sparse matrices.Matlab has nnz(A) where A is sparse matrix method to find out number of non zeros.I was wondering whether there is any such method. — Recker, Jul 17 '12 at 17:07
I personally can't think of any way to calculate that number that would be lower order than just doing the actual matrix multiplication without exploiting some symmetry or structure. I am assuming you want this for memory allocation prior to doing the multiplication? — Godric Seer, Jul 17 '12 at 17:14
Also, i found this paper which describes how to estimate the number on a boolean matrix product (which is identically to counting the elements in any matrix product). — Godric Seer, Jul 17 '12 at 17:18
@GodricSeer..Yes you are right I need the exact number just for memory allocation of resultant matrix.Thanks for the link to paper though.That might get me started in some direction for a while. — Recker, Jul 17 '12 at 18:28
I added the paper, and a more in depth discussion of my comments as an answer. — Godric Seer, Jul 17 '12 at 19:44
Do you really need to know the exact number of nnz(A*B) in advance? Is it not feasible to just start with a rough estimate and then realloc while computing the matrix product, if necessary? A model implementation may be found in the 2nd chapter of the sparse backslash book by Tim Davis. — Stefano M, Jul 17 '12 at 20:29
I wish I could use the realloc method but my GPU has limited memory and even though dynamic memory allocation is possible in CUDA,one must allocate the chunk of memory in advance to use malloc and free in GPU. Further, I dont want to use page locked memory (** that might incur extra PCI-EX transfers *) which can potentially slow down the computations. As one of the answers has already mentioned,I am thinking of launching a kernel which will simulate `(AB)for just for the sake of calculatingnnz(A*B)` and then use it allocate the exact memory for resultant matrix. — Recker, Jul 17 '12 at 20:40
If your implementation has to go on a GPU, I think that computing the exact sparsity pattern of A*B is the way to go... not only for allocating memory, but also for an efficient implementation of the product itself. Good luck. — Stefano M, Jul 17 '12 at 21:11
See CUSP and ViennaCL for efficient sparse matrix multiplication routines on the GPU. Regardless of the architecture, exact symbolic multiplication is usually necessary anyway to make the numeric multiplication efficient. — Jed Brown, Jul 18 '12 at 02:01

score 18 · Answer 1 · answered Dec 18 '12 at 01:16

I actually wrote the original code in Matlab for A*B, both A and B sparse. Pre-allocation of space for the result was indeed the interesting part. We observed what Godric points out -- that knowing the number of nonzeros in AB is as costly as computing AB.

We did the initial implementaion of sparse Matlab around 1990, before the Edith Cohen paper that gave the first practical, fast way to estimate the size of AB accurately. We put together an inferior size estimator, and if we ran out of space in mid-computation, doubled the allocation and copied the partially computed result.

I don't know what's in Matlab now.

Another possibility would be to compute AB one column at a time. Each column can be stored temporarily in a sparse accumulator (see the sparse Matlab paper for an explanation of these), and space allocated to hold the exactly known size of the result column. The result would be in scattered compressed sparse column form -- each column in CSC but no intercolumn contiguity -- using 2 vectors of length numcols (col start, col length), rather than one, as meta-data. Its a storage form that may be worth a look; it has another strength -- you can grow a column without reallocating the whole matrix.

Well for my GPU implementation, I ended up finding the non zero structure first and then finding the actual matrix.Performance was horrible as expected.I think they use the method described in this book to efficiently multiply the two sparse matrices on MATLAB. — Recker, Dec 18 '12 at 01:27
Really cool, thanks for the historical perspective, and welcome to scicomp :) — Aron Ahmadia, Dec 20 '12 at 22:27

score 16 · Accepted Answer · answered Jul 17 '12 at 20:09

16

You can just simulate the matrix-matrix product by forming the product of the two sparsity patterns -- i.e., you consider the sparsity pattern (that is stored in separate arrays in CSR format) as a matrix that contains either a zero or a one in each entry. Performing this simulated product only requires you to form the and operation on these zeros and ones and is thus much faster than the actual matrix-matrix product -- in fact, all you have to do is go through the rows and columns of the two matrices and verify that there is at least one entry in a row and the column you multiply with where both matrices are non-zero. This is a cheap operation -- much cheaper in any case than actually having to do the floating point multiplication in the actual product which not only requires you to do floating point arithmetic (expensive) but also read in the actual floating point numbers from memory (even more expensive, but you don't need that when multiplying the sparsity pattern because the non-zero values of the matrix are stored separately in CSR).

answered Jul 17 '12 at 20:09

Wolfgang Bangerth

55,373
59
119

7

This is called symbolic multiplication. It's not necessarily less expensive than numeric multiplication, especially in parallel, but it only needs to be done once per sparsity pattern. Many algorithms will do the operation multiple times with different numeric values but the same sparsity pattern, in which case symbolic multiplication can be reused. – Jed Brown Jul 18 '12 at 01:58
1

It's a nice idea, but given the millions of transistors that are doing the float*float in parallel, we're only talking about a speed saving of 50% or thereabouts here. – Evgeni Sergeev Mar 15 '16 at 10:36
2

@EvgeniSergeev -- the point is not the savings in computations, but the savings in memory transfer. Since you spend 80% or more time today for memory transfer for a sparse matrix multiplication, you likely gain significantly if you don't have to read/write floating point data from/to memory. – Wolfgang Bangerth Mar 15 '16 at 14:49
1

Would you state the complexity of your method explicitly. If $C$ is $m$ by $k$ it appears to me that your method requires $O(mk)$ work, correct? – Carl Christian May 25 '16 at 07:42
1

@CarlChristian - I'd have to work out the details, but it surely can't be $O(mk)$. It needs to involve the number of nonzeros per row. If you have, on average, $p$ nonzeros in each row, and for simplicity if you have $m=k$, then I imagine you should be able to implement the method in something like $O(mp\log p)$ or similar. That's much better than $O(m^2)$. – Wolfgang Bangerth May 26 '16 at 22:53
I agree that it can be done in $O(mp\log p)$ time. If the matrices are is CSR format, then the sparsity pattern of the $i$th row of $C=AB$ is the union of the sparsity patterns of select rows of $B$ as specified by the ith row of $A$. If there are $p$ nonzeros in the ith row of $A$, then we have to do a $p$-way merge of adjacency lists from $B$. Similar statements apply when the matrices are in CSC. My question was prompted by lines 5-6 in your original answer. To me these lines read much more like a description of a symbolic inner product of sorts than a $p$-way merge of lists. – Carl Christian May 27 '16 at 06:20
@CarlChristian -- yes, I was describing conceptually what the method does, not how it is implemented. – Wolfgang Bangerth May 28 '16 at 02:41

score 6 · Answer 3 · answered Jul 17 '12 at 19:43

This paper describes an algorithm to approximate the size of a resultant from the matrix product of two sparse matrices.

The problem with finding an exact number of non-zero entries in a sparse matrix multiplication is that each element in the resultant depends on the interaction of two vectors, both of which are likely to contain at least a few non-zero elements. Therefore, to calculate the number you need to evaluate logical operations on a pair of vectors for every element in the resultant. The problem with this is that it requires a number of operations similar to the number of operations needed to calculate the matrix product itself. In my comments I mentioned the possibility to exploit certain structures in the non-zero elements of the original matrices, however those same exploits could be used to reduce the work done in the matrix multiplication as well.

You would likely be better off to use the above paper to over-estimate the memory requirements, do the multiplication and then truncate the allocated memory, or move the resultant matrix to a more appropriately sized array. Also, sparse matrix products are not a rare occurrence, and I would almost guarantee that this problem has been solved before. A little digging into some open source, sparse matrix libraries should lead you to the algorithms they use to preallocate memory.

score 0 · Answer 4 · answered Jul 17 '12 at 16:22

For CSR or CSC, are you guaranteed that your array of matrix elements already has no zeros? In that case it is simple to figure out how many non-zero elements there are, using something similar to:

int nnz = sizeof(My_Array)/sizeof(long int);

However if this is not the case (seems a bit too easy) what you could try is a reduction. If your array of matrix elements is very large, this may be the most efficient way to compute the number of nonzero elements. Many parallel C/C++ libraries such as Thrust (a CUDA library) or OpenCL (which you don't need a GPU to use) have support for conditional reductions - for each element, add the result of Condition(Element). If you set the condition to Element != 0 then you'll add up the number of nonzero elements. You may also want to remove the zero valued elements from your array of elements, array of row/column indices, and adjust your column/row pointers.

thanks for your reply...but I was referring to non zeros in A*B where A and B are sparse matrices. I need the number of non zeros in advance so that I can allocate the exact amount of memory to store the resultant matrix. — Recker, Jul 17 '12 at 16:30

score 0 · Answer 5 · answered Jul 22 '12 at 06:09

0

The simplest way to implements CSR is to try

std::vector< std::map<int, complex<float>> >

to represent your matrix. In that case you will not really worry about number of non zero elements, all is accessed via

std::map< int, complex<float> >::iterator

on each row. Best ..

answered Jul 22 '12 at 06:09

2

STL, for when you thought your sparse matrix routines couldn't be made any slower. – Jed Brown Jul 22 '12 at 13:18

What is the best way to determine the number of non zeros in sparse matrix multiplication?

5 Answers5

Linked