What is the best way to compute: $$ Y = D X $$ where $D \in \mathbb{R}^{m\times m}$ is diagonal and $X \in \mathbb{C}^{m \times n}$ is general. I am mostly interested in these two cases:
- $m >> n$, $m > 10^7$
- $n >> m$, $m < 10^4$
Options
I can think of four not-obviously-flawed ways of doing this: loops, forall, loop over zgbmv, loop over zdscal.
Loop
do i = 1,n
do j = 1,m
Y(j,i) = D(j) * X(j,i)
enddo
enddo
- Pros: easy to read, reads D, X, Y in order
- Cons: doesn't re-use D
Forall
forall (i = 1:n, j = 1:m) Y(j,i) = D(j) * X(j,i)
- Pros: concise, gives compiler freedom?
- Cons: gives compiler freedom?
- Notes: previous dicsussions of forall in these comments and this post
zgbmv
Dz = cmplx(D)
do i = 1,n
call zgbmv('N', m, m, 0, 0, one, D, 1, X(1,i), 1, zero, Y(1,i), 1)
enddo
- Pros: similar to loop, but could contain BLAS magic
- Cons: doesn't re-use D, double size of D by casting to complex
zdscal
Y = X
do i = 1,m
call zdscal(n,D(i),Y(i,1),m)
enddo
- Pros: re-uses D, could contain BLAS magic
- Cons: strided reads of Y, requires copy if not in-place
Thoughts
The two major trade-offs seem to be in order reads of X vs re-use of D and use of fortran libraries vs treating D as a real instead of casting to complex. A custom implementation could get the best of both worlds in both cases, but I'm leery of architecture-specific parameters. Best case would be a way to express the operation natively (e.g. loops or forall) and tell the compiler to do the rest.
forall. Usedo concurrentinstead. In practice I've seenforallperformance be absolutely atrocious.I agree with Bill - just benchmark all 4 cases, then answer your own question with some data. If I had to guess, I'd bet on the hand-written loop. I also prefer eliminating loops anywhere possible:
– Aurelius Jan 31 '14 at 18:14do i = 1,n; Y(1:m,i) = D(1:m) * X(1:m,i); enddo;Y(1:m,i) = D(1:m) * X(1:m,i)in your loop body expand as foralls? – Max Hutchinson Feb 03 '14 at 14:52