10

I need to compute a lot of $3\times3$ matrix inverses (for Newton iteration polar decomposition), with very small number of degenerate cases ($<0.1\%$).

Explicit inverse (via matrix minors divided by determinant) seems to work, and is about ~32~40 fused flops (depending on how I compute reciprocal of the determinant). Not considering the det scale factor, it's only 18 fused flops (each of the 9 elements is of the form ab-cd, 2 fused flops).

Question:

  • Is there a way to compute inverse of $3\times 3$ using fewer than 18 (with arbitrary scale) or 32 (with proper scale, considering reciprocal 1 op) fused flops?
  • Is there an economical way (using ~50 f-flops) to compute a backwards-stable left inverse of a $3\times 3$ matrix?

I'm using single-precision floats (iOS game). The backwards stability is interesting new concept for me and I want to experiment. Here's the article that provoked the thought.

Gilles
  • 253
  • 1
  • 2
  • 10
  • What about using Cayley-Hamilton theorem for the inverse? – nicoguaro Mar 07 '16 at 17:41
  • 1
    If this is such a bottleneck for you, could another algorithm for polar decomposition be faster in this case? Through SVD, for example? Or accelerating Newton's method as in 3.3 of http://eprints.ma.man.ac.uk/694/01/covered/MIMS_ep2007_9.pdf? – Kirill Mar 07 '16 at 18:51

1 Answers1

5

I will try to give my thought on the first question regarding fast $3\times 3$ inverse. Consider

$$ A=\left[ \begin{array}{ccc} a & d & g\\ b & e & h\\ c & f & i \end{array}\right] $$

Since the matrices are small and very general (do not feature any known structure, zeroes, relative scales of the elements), I think it would be impossible to give an algorithm for arbitrary scale (without $1/\det(A)$) inverse that is faster than 18 fused flops, as each out of 9 elements requires 2 fused flops, and all products are unique, provided no prior info on $A$'s entries $a,\ldots,i$. $$ A^{-1}\det(A)=\text{adj}(A)= \left[\begin{array}{ccc} ei-fh & di-fg & ge-dh\\ bi-ch & ai-cg & ah-bg\\ ce-bf & af-cd & ae-bd \end{array}\right] $$ Here, $\text{adj}(A)$ denotes the adjugate (transpose of cofactors), which essentially is an inverse with "arbitrary scale" (provided the inverse exists).

However, some calculation can be reused for calculation of the $\det(A)$. If I expand it over the first column (5 more choices are there): $$ \begin{aligned} \det(A)&=a(ei-fh)+b(fg-di)+c(dh-ge)\\ &=a\underbrace{(ei-fh)}_*-b\underbrace{(di-fg)}_*-c\underbrace{(ge-dh)}_* \end{aligned} $$ Notice, that (*) has been already computed during evaluation of $\text{adj}(A)$. So, the reciprocal of determinant can be computed in 4 additional fused flops (if $1/\det(A)$ reciprocal is considered as 1 flop).

Now, each 9 elements of the $\text{adj}(A)$ should be scaled by already obtained reciprocal of the determinant, adding another 9 fused flops.

So,

  1. Calculate $\text{adj}(A)$ in 18 fused flops
  2. Calculate $\det(A)$ in 3 fused flops using entries of already computed $\text{adj}(A)$
  3. Find $\frac{1}{\det(A)}$ (assuming 1 flop).
  4. Scale each elelemt of already computed $\text{adj}(A)$ by $\frac{1}{\det(A)}$ in another 9 fused flops.

Resulting in 18+3+1+9=31 fused flops. You did not describe your way of computing the determinant, but I guess 1 additional flop can be saved. Or it can be used to perform the check $|\det(A)|>\epsilon$ in step 3, where $\epsilon$ is the tolerance for degenerate (not invertible) case, resulting in 32 fused flops (assuming if is 1 flop).

I don't think there is a faster way to compute the inverse of a $3\times 3$ general matrix as all remaining calculations are unique. Using Cayley-Hamilton should not help from the speed perspective, as in general, it will require calculation of $A^2$ for a $3\times 3$ matrix besides some other operations.

NB:

  • this answer does not deal with the numerical stability
  • the possible potential for vectorization and optimizing memory access pattern is also not discussed
Anton Menshov
  • 8,672
  • 7
  • 38
  • 94
  • It appears that all of the off-diagonal entries of the expression for the adjugate here have the wrong sign. Of course that doesn’t affect discussion about number of operations. – JWWalker Apr 25 '23 at 04:12