Why is fast matrix multiplication impractical?

Question

I am wondering why fast matrix multiplications are impractical, especially for Boolean matrix multiplication.

I read some content saying fast matrix multiplications are impractical because of large constant factors. These constant factors are because of algebraic techniques.

I do not understand where these constant factors come from. Some references also say $O(n^{3-\omega})$ may become practical when $O(n^{3-\omega})$ combinatorial algorithm exists. What is the large constant factor and combinatorial algorithm for Boolean matrix multiplication?

Have you thought about looking at these "fast"-asymptotic algorithms to see what is involved? That would probably help. The notation $O(f(n))$ does not really tell you much about practical computer implementation. — Ryan Budney, Apr 28 '22 at 21:16
Related: [How fast can we *really* multiply matrices?](https://mathoverflow.net/questions/101531/) — Peter Mortensen, Apr 29 '22 at 22:48
A non-mathematical answer I have heard that pieces of hardware and libraries are developed so they can efficiently run various standard benchmarks, which tend not to use sophisticated algorithms. — Pteromys, Apr 30 '22 at 15:47

Henri Cohen · Answer 1 · 2022-10-09T08:23:11.510

20

Matrix multiplication based on Strassen's algorithm is in $O(n^{\log(7)/\log(2)})$ and is quite practical. As far as I am aware, for any exponent $\omega<\log(7)/\log(2)$ the corresponding algorithm is impractical, indeed because of huge constants.

Added Oct. 9, 2022:

Apparently, Alphatensor by Deepmind has found (many) ways to multiply $4\times4$ matrices in $47$ multiplications, so I guess the "practicality exponent" is now down to $\log(47)/\log(4)$.

edited Oct 09 '22 at 08:23

answered Apr 28 '22 at 19:07

Henri Cohen

11,624

2

Indeed. According to https://www.researchgate.net/profile/Jianyu-Huang/publication/315365781_Strassen's_Algorithm_Reloaded/links/5c230918a6fdccfc70690b6a/Strassens-Algorithm-Reloaded.pdf Strassen's method can be good even for quite small matrices. – Brendan McKay Apr 29 '22 at 04:04
1

How huge is it? I cannot find an exact value about it. – Jiawei Ren Apr 29 '22 at 05:13
1

@JiawenRen There is no exact value. For the numerical bounds on the current best matrix multiplication algorithm, you can look at section 6.5 of this paper – BrockenDuck Apr 29 '22 at 08:17
2

@JiaweiRen The Jianyu-Huang article cited above mentions "as low as 500" as the turnaround point. That may be considered small in some domains (such as scientific computing) but for general purpose programming a 500x500 array is far from what most would call a "small" matrix. – RBarryYoung May 01 '22 at 16:56
3

Henri, from what I have read, the DeepMind $4\times 4$ multiplication is only valid in characteristic $2$. Apparently, this is supposed to be a practical general matrix multiplication asymptotically faster than Strassen's. – Aurel Oct 09 '22 at 09:33
@RBarryYoung: When I think of numerical linear algebra, computing with graph adjacency matrices, solving discretized differential equations, and calculating weighted autocorrelations are applications that come to mind—so 500 by 500 is indeed small. I don't think the need for stuff like this is limited to scientific computing. Surely there are marketing businesses that want to find autocorrelations of daily time series more than two years long, and logistics companies that want to multiply adjacency matrices of graphs with more than 500 vertices! – Vectornaut Dec 22 '23 at 17:45

Carl Christian · Answer 2 · 2022-10-07T22:21:37.517

I acknowledge that the question concerns Boolean matrix multiplication. However, a good deal of the opposition to fast matrix multiplication algorithms is due to stability issues that can arise when using floating point arithmetic.

Useful algorithms are stable, accurate and fast. Blinding speed is utterly irrelevant if the algorithm is unstable or inaccurate for valid input.

The standard algorithm for computing matrix matrix product $C = AB$ using IEEE floating point arithmetic is forward stable in the following sense. If $\hat{C}$ denotes the computed value, then $$|C - \hat{C}| \leq \gamma_{2n-1} |A||B|, \quad \gamma_k: = \frac{ku}{1-ku}.$$ This inequality should be understood in the component sense, i.e. $$|c_{ij} - \hat{c}_{ij}| \leq \gamma_{2n-1} |f_{ij}|, \quad F = |A||B|.$$ Here $u$ is the unit roundoff and $n$ is number of columns of $A$. It is assumed that $nu<1$ and that the calculation runs to completion without exceeding the representational range (overflow). How is this relevant in the context of fast algorithms? Any polynomial time algorithm for multiplying $n$-by-$n$ matrices together which is stable in the sense described above must use at least $n^3$ multiplications. This is the relevant reference:

W. Miller, “Computational complexity and numerical stability,” SIAM Journal on Computing, vol. 4, no. 2, pp. 97–107, 1975.

It follows that Strassen's algorithm cannot be forward stable in the sense stated above. It does however satisfy the following bound $$ \|C -\hat{C} \| \leq c(n,k) u \|A\| \|B\| + O(u^2), \quad u \rightarrow 0, \quad u > 0$$ where $k$ is the block size where we switch to regular matrix matrix multiplication.

The relevant reference is

N. J. Higham, Accuracy and stability of numerical algorithms. SIAM, 2002.

Do we need forward stability in the componentwise sense? This very much depends on the underlying real life application. If we ignore the question of accuracy and stability, then we just might endanger the people who depend on our software.

To paraphrase MJD, getting the wrong answer in O(1) time is easy! — hobbs, May 01 '22 at 02:37
Are you sure about the standard algorithm being forward stable? Higham's Accuracy and Stability of Numerical Algorithms claims only a worse bound with $|A||B|$ or $|A||B|$. — Federico Poloni, Oct 07 '22 at 20:22
@FedericoPoloni You are quite right. I forgot to take the take the absolute value of the right hand-side when I estimated $|s-\hat{s}| \leq \gamma_{2n-1} |x|^T|y|$, where $s=x^Ty$. My thanks for catching this. — Carl Christian, Oct 07 '22 at 21:54

score 7 · Answer 3 · answered May 03 '22 at 09:37

Addressing the Boolean part.

Usually, fast matrix multiplication relies heavily on the element type being a ring; in particular, that every element has an additive inverse. For example, Strassen's algorithm contains subtractions. This works fine, for example, with reals, integers, rationals, and finite fields. In particular, you could easily do fast matrix multiplication on $\mathbb{F}_2$, that is, elements are bits with addition defined modulo two (so $1+1=0$).

However, in Boolean matrix multiplication the addition of elements is the Boolean disjunction: $1+1=1$ instead of zero. This innocent change means that subtraction no longer works: from $x+1=1$ you cannot know whether $x=0$ or $x=1$. Thus Strassen's algorithm, unmodified, does not work with Booleans.

Yes, this is a slight "impracticality", but it is well known that this can be circumvented relatively easily, by embedding the Booleans into a suitable ring. You can use $b$-bit integers modulo $2^b$, with $b$ chosen large enough so that overflows (wraparounds) will not happen in the fast matrix multiplication. Generally $b$ will be something like a logarithm of the matrix size.

In fact there have been implementations of Strassen-like algorithms for Booleans. One example is Karppa & Kaski (2019), Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures (arxiv).

score 5 · Answer 4 · answered Apr 29 '22 at 12:03

5

As an example, Strassen's algorithm can multiply two 2n x 2n matrices by doing 7 multiplications and 18 additions of n x n matrices. So you save one multiplication for 18 additions of n x n matrices. You have to find temporary memory to store intermediate results, so all in all there is quite a bit of overhead. On the other hand, using brute force a 2n x 2n multiplication will do more operations per second than n x n. So it takes a reasonably large matrix size to make Strassen faster than brute force.

For faster methods, the overhead is much higher. So high that a matrix must be huge to make the "faster" methods actually faster.

But there is more to consider: Strassen's algorithm may use up to 12.5% fewer operations for very large n, but there are plenty of things that have a much larger impact. Like processor caches, reducing the latency of operations,combining multiply and add, vectorising, multi-threading. These things become harder to do when the algorithm gets more complicated. So not only do you need to have fewer operations, you also to have the same number of operations per seconds.

answered Apr 29 '22 at 12:03

gnasher729

371

2

Surely Strassen's algorithm can save more than $12.5%$ if $n$ is so large that nested Strassen is viable? – Will Sawin Apr 29 '22 at 13:13
1

@WillSawin That is not possible because Strassen's algorithm is already a nested algorithm, which is how we get to the O(...) bound described. You continue to use Strassen's algorithm on the sub blocks until it reaches a threshold where it is more efficient to switch back to brute force. The literature I've seen shows that Strassen's starts to be a better choice around 1024x1024, so said 1024x1024 matrix would use this breakdown once before switching to brute force. A 2048x2048 might use it twice before switching. – Cort Ammon Apr 29 '22 at 17:54
1

@CortAmmon With that definition of Strassen's algorithm, which I agree is better, my question can be shortened to "Surely Strassen's algorithm can save more than $12.5%$ if $n$ is large?" - can you really give a negative answer? – Will Sawin Apr 30 '22 at 00:36
@WillSawin: The brute force algorithm is equivalent to doing 8 multiplications, and Strassen manages it in 7, so you save 1/8 of the work (the additions are linear and therefore dominated by the multiplications). – Kevin Apr 30 '22 at 01:39
1

@Kevin Sure, but each multiplication can itself be achieved by doing 7 multiplications instead of the brute force 8, so you save another $1/8$ (asymptotically). The question is only how much of the asymptotic savings can be achieved at practical values. I would be very surprised if the answer is exactly equal to the maximum theoretical savings from doing only one iteration, given that multiple iterations are viable in practice! – Will Sawin Apr 30 '22 at 01:56
@WillSawin: The brute force algorithm is also recursive (or at least, it is equivalent to a recursive version). You save 1/8 at each level, but that's equivalent to saving 1/8 overall. – Kevin Apr 30 '22 at 02:02
8

@Kevin This is absurd - if that were true, the asymptotic savings of Strassen's algorithm would itself be by a factor of $7/8$, and not by a factor of $n^{ \log (7/8) / \log 2}$, as it is. – Will Sawin Apr 30 '22 at 02:09
2

A 2-layer Strassen with element multiplication and addition taking the same time (not crazy say, over a finite field) exceeds a factor of 7/8 at 256x256. – Charles Apr 30 '22 at 02:53
2

You save 1/8 at each level, but also the levels are exponentially smaller because of the savings from the above levels. – kaya3 Apr 30 '22 at 09:25
3

@WillSawin From that phrasing, I would agree. A single application of Strassen's algorithm saves 12.5%. Iterating it twice saves more. The asymptotic limit given in big-O notation is always the limit of this process. The tricky bit is that each iteration applies only to a matrix that's 2x bigger in each direction. So these 12.5% boons appear further and further apart as n gets large. – Cort Ammon Apr 30 '22 at 19:00

Why is fast matrix multiplication impractical?

4 Answers4