I've just (to my embarrassment) encountered a BLAS-like extension of a matrix-matrix product subroutine gemm in Intel MKL: gemm3m. This subroutine (particular versions: cgemm3m and zgemm3m) allows performing matrix-matrix multiplication for complex-valued matrices using fewer arithmetic operations.
The gemm3m documentation claims that it
...reduces the time spent in matrix operations by 25%, resulting in significant savings in compute time for large matrices.
Looking at the provided error analysis in the Application Notes, I don't see anything "criminal":
$$
\hat{C}=\text{fl}(C_1+iC_2)=\text{fl}\big((A_1+iA_2)(B_1+iB_2)\big)=\hat{C}_1+i\hat{C}_2
$$
$$
||\hat{C}_1-C_1||\leq 2(n+1)u||A||_\infty||B||_\infty+\mathcal O(u^2)\\
||\hat{C}_2-C_2||\leq 4(n+4)u||A||_\infty||B||_\infty+\mathcal O(u^2)
$$
where $A,B,C\in\mathbb C^{n\times n}$ are complex matrices, $A_{1,2},B_{1,2},C_{1,2}\in\mathbb R^{n\times n}$ are their real and imaginary parts, respectively, $i=\sqrt{-1}$; $\hat{C}\in\mathbb C^{n\times n}$ and $\hat{C}_{1,2}\in\mathbb R^{n\times n}$ are the result of floating-point operations on $A$ and $B$ accoring to the gemm3m matrix-matrix multiplication algorithm. $|u|<\epsilon_\text{mach}$ if the floating-point arithmetic is IEEE-754 and no underflow\overflow happens.
So, is there any catch on using zgemm3m vs regular zgemm? Is there a situation where I should avoid using zgemm3m?
a=0.1; b=13e-10; c=0.3; d=31e-10;then in double precision (with octave):a*d+b*c = 7.000000000000001e-10, and(a+b)*(c+d)-a*c-b*d = 6.999999983771085e-10, which is less accurate. – wim Jul 31 '19 at 09:19