What is the fastest method to invert millions of matrices?

Question

My project involves large simulation and estimation. For each simulation I need to solve 600,000 systems of nonlinear equations. Currently I am using Newton's method to find the solutions. That involves inverting 600,000 million Jacobian matrices at each iteration. Currently I iterate 100 times for convergence in MATLAB with a mex C file, it takes 250 seconds for one simulation on one core. This speed renders it infeasible for estimation.

Could you guys tell me which should be the fastest way to invert large number of small matrices. All the matrices are between 2 by 2 to 6 by 6. My gut feeling is that CUDA might be the only method feasible for estimation. I am currently translating the code to MKL fortran and I have no experience with CUDA at all. So please give me some advice - I need to decide which platform to implement, and soon.

Since you already know the matrix dimensions, can you just code up the analytical solution? Ex. https://www.dr-lex.be/random/matrix-inv.html — 0b1100001, Mar 14 '18 at 17:32
Are you convinced that the code is spending a lot of time inverting those small matrices in the newton loop (have you profiled it?) If you search for "invert small matrices" on this site you will find some good answers to questions very similar to your own - like this one — GoHokies, Mar 14 '18 at 18:02
Also, it always pays to think twice before explicitly inverting a matrix. I'm assuming your Jacobians change between iterations, so (at least for size 4x4 and above) it may actually be faster to use LU instead of an explicit inverse. YMMV. — GoHokies, Mar 14 '18 at 18:06
Agree with @GoHokies, explicit inversion may not be needed if it's an intermediate step for downstream calculations. The fastest way to invert may be to not do it at all. — Nuclear Hoagie, Mar 14 '18 at 18:16
If your objective is to solve a system of linear equations, are there any cases where an explicit inverse WOULD be better than LU decomposition? (I can't think of any.) — Bill Greene, Mar 14 '18 at 18:37
@GoHokies, thank you for your advice but I am already using mldivide in matlab and dgesv in fortran, which utilize LU...still not fast enough. Any idea whether CUDA might be feasible? — THwang, Mar 14 '18 at 18:55
@maverick, thank you so much. According to some answers here, analytical solution might be faster. Do you have any clue how fast CUDA could go since the estimation require 20,000 evaluations. — THwang, Mar 14 '18 at 19:01
I suggest you provide more details of your implementation. For example, exactly what parts are you doing in matlab and what parts in your mex function? Do you mean that you REPLACED an implementation using mldivide with one that uses dgesv? — Bill Greene, Mar 14 '18 at 19:03
@BillGreene, in my prototyping code, I used mldivide() for inversing matrix, which is based on LU decomposition to solve linear equations, and the C mex is generated by MATLAB coder (I do not know how to program in C), the time for one simulation is 240 seconds. In my fortran code, I use dgesv, which is 6 times faster than MATLAB with C mex. Still want to improve the speed for estimation. Thank you! — THwang, Mar 14 '18 at 19:07
I suggest you don't use those explicit inverse formulas that @maverick posted. Cramer's rule is not backward stable so your implementation won't be very robust. — GoHokies, Mar 14 '18 at 19:11
How many of those 240sec per simulation are spent inside the mldivide or dgesv routines? — GoHokies, Mar 14 '18 at 19:18
@GoHokies, nearly all the time is spent on inverting those matrices. One simulation inverting 600,000 matrices times iterating 100 times. My code structure is using Newton to solve most of the system of equations, and then use fsolve (or NAG fortran library function) to solve those could not be solved by Newton. The MATLAB code 250s in total, 240 spent on inverting matrices. — THwang, Mar 14 '18 at 19:26
@GoHokies didn't know about Cramer's rule being unstable, good to know. Thanks. — 0b1100001, Mar 14 '18 at 20:46
What kind of nonlinear equations do you have? Quadratic? Polynomial? — Rodrigo de Azevedo, Mar 15 '18 at 19:40
@RodrigodeAzevedo, system of multinomial logit functions without closed form solution — THwang, Mar 17 '18 at 17:57
If your 600 000 matrices differ by a few elements, you can use the Sherman-Morrison formula: https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula — Anthony Scemama, May 07 '18 at 21:58

score 1 · Accepted Answer · answered Mar 15 '18 at 14:17

I known that this is not a definitive answer, but it can give an idea how to move with CUDA. At this stage is difficult to give advice because for this is necessary a detailed know about the actually code

As already write in comments, in general, is better not invert the matrix, but solve the linear system (for detail about why see this question).

There is the possibility that CUDA can help you , but before you should consider some aspects. In general the real bottleneck in a CUDA application are the memory transfer, sometimes is better to recalculate things respect than move them, you must consider the memory available and in other case the memory transfer use more time respect the use of cpu. Another point to focus is the presence of if condition along the flow you like to parallelize, in this case the flows is stopped and the two different branches are execute in sequence with a big degradation of the performance. What precision do I need? This is important for the choose of the hardware, and can depend also by the algorithm that you use.

With this in mind you can start study what parts of your algorithm have got the best advantage from a parallelization. For example is better assign to each CUDA Kernel a linear system or is better parallelize only some task? Maybe an idea, not really radical respect your code, is to parallelize the solution of the linear system (= invert the matrix). Another, if you use feval, is try to use the parallel feval in matlab. Other possibility can be to consider variants of Newton's method to obtain speed up, this is not only related with the use of GPU.

Some link:

CUDA library for solver in cuda.
pdf with an example in CUDA for a particular case of Netwon's method
Article pdf with a variant of Newton's method with CUDA. Parallel interval newton method on CUDA by Philip-Daniel Beck, Marco Nehmeier

@THwang I hope I did not get scared :-). However matlab offers the possibility to use cuda inside ([see this] (https://it.mathworks.com/discovery/matlab-gpu.html), maybe it can be the first step! — Mauro Vanzetto, Mar 15 '18 at 15:00
thank you so much! Sorry I accidently click unuseful for your advice. I am just wondering whether my code structure is suitable for CUDA since I have no experience can you please help me have a look again? — THwang, Mar 15 '18 at 15:10

What is the fastest method to invert millions of matrices?

1 Answers1

Linked