Is there an algorithm or graph theory that allows me to not need to store an intermediate matrix when calculating ATY1A + BTY2B?

Question

I have a system of conductors for which there are two dense matrices of the (complex) mutual admittances, $Y_A$ and $Y_B$, which are symmetric. Then, an equivalent nodal admittance matrix $Y_N$ is calculated by the following:

$$ Y_N = A^T \cdot Y_A \cdot A + B^T \cdot Y_B \cdot B $$

where $A$ is a directed incidence matrix (directed graph) for which each element $a_{ik}$ is given by

$$ a_{ik} = \begin{cases} 1 \quad \text{if node $k$ is the start point of conductor $i$} \\ -1 \quad\text{if node $k$ is the end point of conductor $i$} \\ 0 \quad\text{otherwise} \\ \end{cases} $$

and $B$ is an undirected incidence matrix for which the elements $b_{ik}$ are given by

$$ b_{ik} = \frac{|a_{ik}|}{2} = \begin{cases} 1 / 2 \quad \text{if node $k$ is connected to conductor $i$} \\ 0 \quad\text{otherwise} \\ \end{cases} $$

Each coductor has exactly one start point and exactly one end point, which makes $A$ and $B$ sparse, with only two non-zero entries per line. But I store them as dense matrices so I can pass them to BLAS and LAPACK routines.

The system have $m$ conductors and $n$ nodes. $Y_A$ and $Y_B$ are $m \times m$ and $A$ and $B$ are $m \times n$.

I am using BLAS to do those calculations by calling CSYMM and CGEMM two times each. The following pseudo-code summarizes the steps.

calculate A and B, storing them as dense complex matrices
for i = 1:N
    calculate YA and YB
    C := YA * A + 0 * C  # CSYMM
    YN := A^T * C + 0 * YN  # CGEMM
    C := YB * B + 0 * C  # CSYMM
    YN := B^T * YB + 1 * YN  # CGEMM
end

The system I want to simulate have about 50000 conductors and nodes. I am storing 6 dense matrices, $A, B, C, Y_N, Y_A, Y_B$, with 2.5 billions elements of complex float each. That requires 120 GB of memory and I would like to reduce that.

At first I thought about storing $A$ and $B$ as dense bit matrices (inspired by this question on StackOverflow), get the source code of cgemm and csymm at the netlib site and rewrite a version of them to use the bit matrices, doing the bit to complex type conversion inside the loop. That saves me almost 40 GB of memory and seems like a good solution, but I am wondering: is there an algorithm or graph theory that would allow me to not need to store the intermediate complex matrix $C$, saving me another 20 GB of memory?

Are $Y_A, Y_B$ symmetric? How sparse are $A$, $B$? What are their sizes? — Federico Poloni, May 30 '22 at 21:17
Also, why are you storing $A$ and $B$ as dense matrices, if they are sparse? — Federico Poloni, May 30 '22 at 21:18
This is an inefficient way to store A and B. Each conductor has only one start and one end, right? Out of the 2,500,000,000 matrix elements in A, only 100,000 of them aren't zero. You should consider just storing the indices of the ones that aren't zero. Since you know every conductor has a start and an end, all you have to store is 2 node indices per conductor, 400kB of memory for A, and B can be calculated from A. It doesn't solve your overall problem though. — user253751, May 31 '22 at 08:34
$Y_A$ and $Y_B$ are symmetric. $A$ and $B$ are stored as dense so I can pass them to BLAS routines. I've put this information in an edit. The system have $n$ conductors and $m$ nodes. $Y$ are $m \times m$ and $A$ and $B$ are $m \times n$. — Pedro H. N. Vieira, May 31 '22 at 15:01
If I understand the problem statement correctly, the statement "Each conductor has exactly one start point and exactly one end point" translates to "the connection graph is a directed graph, consisting of one or more circles". This would allow for a renumbering of the Graph nodes, such that almost all non-zero elements of A and B are next to the diagonal. E.g. 0->1->2->3->0 would just have a[0,3] and a[3.0] away from the diagonal. — MSalters, May 31 '22 at 15:02
@MSalters I don't understand exactly what you are saying (I'm not very familiar with graph theory), but it sounds promissing. Where can I read more about what you are proposing? — Pedro H. N. Vieira, May 31 '22 at 15:05
@PedroH.N.Vieira: I think you might be able to just experiment a bit yourself. Create a small example A, say with 16 entries. You can then follow the conductors: 0 connects to X, and X connectts either back to Y (small circle, size 2) or to Y. Y again connects either back to 0 (circle size 3, 0->X->Y->0) or to a new element Z etc. You can now renumber {0,X,Y,Z} to {0', 1', 2', 3'}. With this renumbering, in general you break the problem into multiple independent sub-problems, each of which doesn't even require matrix multiplications anymore. This problem should fit in a megabyte. — MSalters, May 31 '22 at 15:15
Nothing can be said about $n$ and $m$. It depends on the system being simulated. Usually, but not always, $n > m$. edit: I realized that, in my comment above, I swapped $m$ and $n$, and can't edit it now. $n$ is the number of nodes. $m$ is the number of conductors. — Pedro H. N. Vieira, May 31 '22 at 16:01
@MSalters the question doesn't seem to state that each node can only be the starting point (or end point) of one conductor, though? — justhalf, Jun 01 '22 at 02:32
Have you considered using something like C++ with Eigen that may let you get operations done on the fly inside optimized loops (e.g. generating A and B on the fly from 1-byte or 2-bit elements instead of 8-byte complex float)? Or with AVX-512, two separate arrays of bitmasks (one for non-zero, one for sign; B needs only the non-zero one). Those could be used to generate a +1.0 or -1.0 on the fly (sign mask) for merge-masked FMAs using the other mask. (Although doing that naively would mean needing a bitwise transpose, so probably better to just use sparse algorithms.) — Peter Cordes, Jun 01 '22 at 04:16
Anyway, choosing to use dense BLAS routines for this means a huge amount of the work your CPUs are doing is just adding zeros, as well as wasting vast amounts of memory because you need the whole complex-float matrix available as an input, instead of materializing parts of it on the fly cheaply from a compressed representation. That's cheap enough (especially if you keep separate byte elements, not sub-byte packing) that it's fine for matmul to be doing that on the fly, probably coming out ahead if memory bandwidth and/or cache footprint were bigger bottlenecks than SIMD ALUs. — Peter Cordes, Jun 01 '22 at 04:20

score 13 · Accepted Answer · answered May 31 '22 at 03:35

13

BLAS may not have a function to compute what you are asking for, but the product $$ Y_N = A^TY_AA + B^T Y_B B $$ means that the entries $(Y_N)_{ij}$ are defined by $$ (Y_N)_{ij} = \sum_{k,l} (A^T)_{ik}(Y_A)_{kl}A_{lj} + \sum_{k,l} (B^T)_{ik}(Y_B)_{kl}B_{lj} $$ which is equivalent to $$ (Y_N)_{ij} = \sum_{k,l} \left\{ a_{ki}(Y_A)_{kl}a_{lj} + b_{ki}(Y_B)_{kl}b_{lj} \right\}. $$ Given your formula for how $A$ and $B$ are related, this is in turn equal to $$ (Y_N)_{ij} = \sum_{k,l} \left\{ a_{ki}(Y_A)_{kl}a_{lj} + \frac 14 |a_{ki}|(Y_B)_{kl}|a_{lj}| \right\}. $$

You can implement this operation using four nested loops over $i,j,k,l$, and if you make the loops over $i,j$ the outer ones, you don't need any intermediate storage -- in fact, you don't need to store the $B$ matrix either based on the formula. You really don't need to store the $A$ matrix either; all you need to know is whether $a_{ij}$ is zero, plus one, or minus one. That can be stored with far fewer bits than as a floating point number.

This quadruple loop may, if you do it naively, not be as fast as what BLAS does. But it saves memory, and that was after all what you were after. On the other hand, reducing memory requirements on modern processors often translates into faster computations, so it may go either way.

answered May 31 '22 at 03:35

Wolfgang Bangerth

55,373
59
119

1

You should be able to cache-block the quadruple loop somehow. Getting it to auto-vectorize with SIMD might be a problem, so perhaps best to use a sparse representation as well as compressed -1 / 0 / +1 int8_t, so you can skip to the next non-zero a_{ki} and/or next non-zero a_{lj} efficiently, traversing the sparse representation directly. If it's not going to vectorize with SIMD in the first place, then take full advantage of being scalar. – Peter Cordes Jun 01 '22 at 04:29
And maybe it actually could auto-vectorize with gather loads from Y, if you arrange your A and Ya / Yb storage so the things you want to iterate over / index by in the inner-most loop are contiguous / linear respectively. Actually, it looks like only a_ki varies with i inside the expression you're summing, so some unrolling or cache blocking over that could be very good, reusing the current values of all the other things for multiple i values in the inner-most loop. But that would mean scattered stores to YN to add to the total there. Anyway, there's a lot you could do with this form. – Peter Cordes Jun 01 '22 at 04:37
You can precompute the products (aki * alj) here, since they are commutative to avoid storing the intermediate value for the inner loops. – xryl669 Jun 01 '22 at 13:34
And given the problem description, the matrices A and B are really an artifact of the approach of using BLAS to perform the computation. A typical sparse representation would be much closer to a list of the start and end nodes for each conductor, and by using such a list directly, one can reduce the computation to a triple loop instead of a quadruple one. – John Bollinger Jun 01 '22 at 14:23

Thijs Steel · Answer 2 · 2022-05-31T17:52:06.733

2

I agree with the comments that your best bet is likely to use a sparse multiplication so that you don't need to store A and B as dense.

If you want something easy that uses high level lapack/blas routines, consider the following algorithm:

Calculate cholesky decompositions $Y_a = L_aL_a^H$ and $Y_b = L_bL_b^H$ (You can use $Y_n$ for workspace).
$A := L_a^HA$ and $B := L_b^HB$ (ctrmm)
$Y_n := A^HA$ (cherk)
$Y_n := B^HB + Y_n$ (cherk)

Some downsides to this approach:

Assumes matrices are positive definite, if not, you need to use LDLT instead
Likely introduces extra rounding error
Cholesky decomposition may be prohibitively expensive to calculate (doesn't scale as well as gemm)

Some upsides to this approach:

Less memory, you can even store the triangular parts of $Y_a$ and $Y_b$ in the same $m \times m$ matrix if you store the diagonal separately.
Likely faster than dense for loop approach. I think the for loop will have many cache misses. (And a quadruple for loop would be O(n^4), pretty expensive for matrices that large)

edited May 31 '22 at 17:52

answered May 31 '22 at 16:16

Thijs Steel

1,693
8
16

The $Y$ matrices are dense (every element is non-zero), indefinite, complex, but symmetric: $y_{ik} = y_{ki}$. I haven't tested using sparse-dense matrix multiplication yet. I'll give it a try. – Pedro H. N. Vieira May 31 '22 at 16:39
@PedroH.N.Vieira So they are complex symmetric, not Hermitian $y_{ik} = \overline{y_{ki}}$? – Federico Poloni May 31 '22 at 22:02
1

Just symmetric, not Hermitian. Those elements are calculated from the electromagnetic coupling from one conductor to another (induced fields in one by charges in the other). It doesn't matter which one is considered the "source". – Pedro H. N. Vieira May 31 '22 at 22:20
Ah, in that case you could split the matrix into its real and complex part. Then you have two "real" symmetric matrices, but if they are indefinite cholesky won't work anyway. – Thijs Steel Jun 01 '22 at 11:49

Is there an algorithm or graph theory that allows me to not need to store an intermediate matrix when calculating AT*Y1*A + BT*Y2*B?

2 Answers2

Is there an algorithm or graph theory that allows me to not need to store an intermediate matrix when calculating ATY1A + BTY2B?