Exploiting patterns in matrix for efficient matrix-vector multiplication

Question

I have the following situation: I have a sequence of vectors $x_1, x_2,.. $ and for each I want to compute the product $Ax_i$ where $A$ is fixed at the outset. Although there is no information about the structure of $x_i$, $A$ typically has a particular pattern where many values are repeated and I would like to compute these products as fast as possible.

One example of $A$ looks like this:

Here the white regions are 0.

I wonder if there is some way of storing information about $A$ or modify it somehow that would allow me to reduce the number of operations for each product. For rows that are all 0 this is trivial – one can just store the row indicies that indicate such rows. It's also possible to store information about which rows are duplicated so as to reuse row computations. I have also considered ordering the rows of the matrix such as to minimize the mean difference between each row and only compute the difference at each row. This seems to run into problems for the more complicated patterns, however.

I was wondering if there is any known methods for these kinds of problems.

Edit: another idea I had is that since the no. of unique values in the matrix is fairly low, one could decompose the product as $Ax = A_1x + A_2x + \dots A_nx$ where $A_i$ contains only one unique value, but I'm still not sure if this can provide any advantage for this problem.

If there are $d$ distinct values in a row of the matrix, and $d$ is much less than the total number of elements in a row (which looks to often be the case here), you only need to use $d$ multiplications to calculate the dot product of that row with the $x$ vector, since multiplication distributes over addition: e.g., $ax_{i1} + bx_{i2} + bx_{i3} + ax_{i4} + ax_{i5} = a(x_{i1} + x_{i4} + x_{i5}) + b(x_{i2} + x_{i3})$. — j_random_hacker, Mar 14 '18 at 19:32
I can see some blocks of columns that, in many rows, are identical (e.g., a block of columns that is all-yellow in many rows). For a given vector $x_i$, if you compute the sum of the elements of $x_i$ in that block, then you can use that to speed up things for those rows. — D.W., Mar 14 '18 at 18:44
@MauroVanzetto not using any linalg libraries. Everything is stored as builtin C/C++ vectors/arrays — Slug Pue, Mar 15 '18 at 14:45
And now how do you make the product? I try to do a practical consideration. The use, direct or indirect through other library, of Blas permit to use in near optimal mode your hardware (thing very difficult to obtain by a custom matrix vector product). So maybe with the use of Blas you can archive a big speed up with a limited effort. — Mauro Vanzetto, Mar 15 '18 at 14:57
Why not use Cuthill–McKee or Reverse Cuthill–McKee (RCM) to permute the rows and/or columns and obtain a band matrix? In Python, use reverse_cuthill_mckee. In MATLAB, use symrcm. — Rodrigo de Azevedo, Mar 15 '18 at 16:15
@MauroVanzetto currently this is done by plain nested looping, going row-major through the matrix — Slug Pue, Mar 15 '18 at 16:18

score 3 · Accepted Answer · answered Mar 15 '18 at 18:25

I suggest a different point of view. Maybe you can obtain a improvement of the performance with some clever matrix multiplication, but there are more than one possibility that you obtain small results respect the effort.

This kind of matrix is small, we are talking about $138 \times 78$, the moder cpus have got a lot of power and no problem to work over this size. The bottleneck is move the data to the cpu. This kind of problems have been addressed by the Blas libraries. The Blas library take care not only of the multiplication, but how to optimize the data moving inside the hardware.

It is very difficult, to be clear near impossible for us, try to obtain best performance respect the Blas function. The classical examples are the nested loops. For example the Atlas a particular implementation of Blas when is installed make an auto tuning over the hardware (see this pdf).

For these reason the fist suggestion I tell to you is try to use a Blas library. See the before wiki page for a list, there are commercial or free, here depend of you (maybe you can start with OpenBlas). Note that also there are library that use Blas under them and they are more comfortable.

If this is not enough try with other way, but remember use Blas for the multiplication.

The case is different if the number of zero elements are more and more, no this the case, to give an idea about 90%. Here you have sparse matrix and you can use a different storage methods to obtain advantage. Note that also in this case you can find sparse Blas.

I totally agree with this answer, especially regarding of what to try first. Starting from a dense matrix-vector product using a highly optimized linear algebra library is a good thing and later you will be able to benchmark any clever-techniques you come up with against that. If needed and desired. — Anton Menshov, Mar 15 '18 at 18:46

score 0 · Answer 2 · answered Mar 15 '18 at 16:25

Disclaimer: I have no idea if this will actually speed up your calculation, as it adds quite a bit of computational overhead. Since it looks like your matrix isn't very sparse, it's hard to imagine beating a BLAS implementation like Intel MKL.

That said, here's an idea:

Store an array of sparse matrices, one for each unique value in your matrix, where each sparse matrix is stored in compressed row format. The clever thing here is that you don't need to actually store the sparse matrix entries ($A$ from the Wikipedia article), only the sparse patterns $IA$ and $JA$ since the matrix entries are all the same. When performing a matrix-vector product with one of these sparse matrices, you only need to sum the entries from your vector $x$ since the multiplication can take place at the end.

If you have some values in the matrix that don't have duplicates, you can throw them all into one conventional sparse matrix, and do the MVP the "normal" sparse matrix way.

Obviously, the implementation for this approach is not trivial. Even if it is faster than a straight BLAS implementation, you would probably need to have a lot of $x_i$ vectors to compensate for the computational overhead in the sparse matrix setup. I do think that this approach would provide quite a bit of storage savings, for what that's worth.

Exploiting patterns in matrix for efficient matrix-vector multiplication

2 Answers2