What is the best way to multiply a diagonal matrix (in fortran)

Question

What is the best way to compute: $$ Y = D X $$ where $D \in \mathbb{R}^{m\times m}$ is diagonal and $X \in \mathbb{C}^{m \times n}$ is general. I am mostly interested in these two cases:

$m >> n$, $m > 10^7$
$n >> m$, $m < 10^4$

Options

I can think of four not-obviously-flawed ways of doing this: loops, forall, loop over zgbmv, loop over zdscal.

Loop

do i = 1,n
  do j = 1,m
    Y(j,i) = D(j) * X(j,i)
  enddo
enddo

Pros: easy to read, reads D, X, Y in order
Cons: doesn't re-use D

Forall

forall (i = 1:n, j = 1:m) Y(j,i) = D(j) * X(j,i)

Pros: concise, gives compiler freedom?
Cons: gives compiler freedom?
Notes: previous dicsussions of forall in these comments and this post

zgbmv

 Dz = cmplx(D)
 do i = 1,n
   call zgbmv('N', m, m, 0, 0, one, D, 1, X(1,i), 1, zero, Y(1,i), 1)
 enddo

Pros: similar to loop, but could contain BLAS magic
Cons: doesn't re-use D, double size of D by casting to complex

zdscal

 Y = X
 do i = 1,m
   call zdscal(n,D(i),Y(i,1),m)
 enddo

Pros: re-uses D, could contain BLAS magic
Cons: strided reads of Y, requires copy if not in-place

Thoughts

The two major trade-offs seem to be in order reads of X vs re-use of D and use of fortran libraries vs treating D as a real instead of casting to complex. A custom implementation could get the best of both worlds in both cases, but I'm leery of architecture-specific parameters. Best case would be a way to express the operation natively (e.g. loops or forall) and tell the compiler to do the rest.

It looks like you've written most of the code, what do your performance numbers say? — Bill Barth, Jan 31 '14 at 18:00
According to Dr. Fortran himself you shouldn't ever use forall. Use do concurrent instead. In practice I've seen forall performance be absolutely atrocious.
I agree with Bill - just benchmark all 4 cases, then answer your own question with some data. If I had to guess, I'd bet on the hand-written loop. I also prefer eliminating loops anywhere possible:

do i = 1,n; Y(1:m,i) = D(1:m) * X(1:m,i); enddo; — Aurelius, Jan 31 '14 at 18:14
Of course the answer depends on a lot of different factors: single or multithreaded execution, CPU/GPU architecture, L1-LN cache sizes, ... Just benchmark and pick the best result, being prepared to obtain completely different results on different machines or compilers. This said, if code robustness is your main concern I would go for the nested explicit loop. — Stefano M, Feb 02 '14 at 23:48
@StefanoM, Bill, and Aurelius: I'm really looking for the best general solution or 'there is no general solution, you have to tune for each platform'. It looks like loops are the way to go, but I've only sampled 2 points in configuration space. — Max Hutchinson, Feb 03 '14 at 14:49
@Aurelius do the Y(1:m,i) = D(1:m) * X(1:m,i) in your loop body expand as foralls? — Max Hutchinson, Feb 03 '14 at 14:52

score 10 · Accepted Answer · answered Feb 03 '14 at 14:43

tl;dr Use loops

My numbers indicate that ifort is smart enough to recognize the loop, forall, and do concurrent identically and achieves what I'd expect to be about 'peak' in each of those cases. gfortran, on the other hand, does a bad job (10x or more slower) with forall and do concurrent, especially as N gets large. Both ifort and gfortran seem to produce identical results for forall and do concurrent.

I'm using MKL for BLAS with both ifort and gfortran. gbmv achieves a steady 3x slower than 'peak'. 'scal' is close to peak for small problems, especially small N, but quickly falls behind. It is never worse than gfortran's forall and do concurrent, though.

On systems like mine (standard workstation configuration), it looks like loops are both the most robust and highest performance for all N,M. do concurrent has no advantage over forall: both are bad.

You can find result tables and code here. Are results similar for IBM or PGI?

Interestingly, the performance as a multiple of the copy time Y = X doesn't doesn't depend on N. I would have thought re-use of D would have improved performance with higher N, similar to GEMMs outperforming GEMVs.

Notes

$ ifort --version
ifort (IFORT) 14.0.1 20131008

$ gfortran --version
GNU Fortran (Debian 4.7.2-5) 4.7.2