Task-based shared-memory parallel libraries in Scientific Computing

Question

In recent years, several libraries/software projects have appeared that offer some form or another of general-purpose data-driven shared-memory parallelism.

The main idea is that instead of writing an explicitly threaded code, programmers implement their algorithms as inter-dependent tasks which are then scheduled dynamically by a general-purpose middleware on a shared-memory machine.

Examples of such libraries are:

QUARK: Originally designed for the MAGMA parallel linear algebra library, seems to have been used for a parallel Fast Multipole Method too.
Cilk: Originally an MIT-based project, now supported by Intel, implemented as language/compiler extensions to C, used in the Cilkchess computer chess software and experimentally in FFTW.
SMP superscalar: Developed at the Barcelona Supercomputing Center, similar to Cilk in many respects, based on #pragma extensions.
StarPU: Similar library based "codelets" which can be compiled for and scheduled on several different architectures, including GPUs.
OpenMP tasks: As of version 3.0, OpenMP introduced "tasks" that can be scheduled asynchronously (see Section 2.7 of the specification).
Intel's Threading Building Blocks: Uses C++ classes to create and launch asynchronous tasks, see Section 11 of the Tutorial.
OpenCL: Supports task-based parallelism on multi-cores.

While there is a lot of literature describing the inner working of these libraries/language extensions and their application to specific problems, I have only come across very few examples of them being used in practice in scientific computing applications.

So here's the question: Does anybody know of scientific computing codes using any of these libraries/language extensions, or similar, for shared-memory parallelism?

Are you looking for task-based parallelism? Is there a reason you skipped OpenCL and Intel TBB? I have to admit I can't tell exactly what you're looking for here. — Aron Ahmadia, Sep 13 '12 at 19:00
@AronAhmadia: Ignorance, mainly... :) I've added TBB and OpenCL to the list, but the question is still the same: Have these, i.e. their task-based components, been used in any significant piece of software for scientific computing? — Pedro, Sep 14 '12 at 08:17
How do we feel about turning this question and its answers into a community-wiki vs. trying to scope it further? — Aron Ahmadia, Sep 14 '12 at 11:49
@AronAhmadia: I'm a bit worried that if I leave the question format, this will quickly degenerate into long discussions on the advantages/disadvantages of task-based and/or shared-memory programming in general. I would, however, be in favour of switching it after it's gotten a few more answers. — Pedro, Sep 15 '12 at 11:36
The title is not appropriate. This question is about task parallelism, not shared memory. — Jeff Hammond, Jul 23 '13 at 06:51

score 8 · Answer 1 · edited Sep 18 '12 at 14:39

8

deal.II uses the Threading Building Blocks throughout the library and by and large we're reasonably happy with it. We've looked at a few alternatives, in particular OpenMP since everyone seems to be using that for simpler codes, but found them lacking. In particular, OpenMP has the huge disadvantage that its task model does not allow you to get a handle for a task you started, and consequently it is difficult to access the state of a task (e.g. to wait for it finishing) or return values of functions you run on a separate task. OpenMP is primarily good for parallelizing the innermost loops but you gain parallel efficiency by parallelizing the outermost, complex loops, and OpenMP is not the tool for that while the TBB are reasonably good for that.

edited Sep 18 '12 at 14:39

Jack Poulson

7,599
32
40

answered Sep 14 '12 at 11:20

Wolfgang Bangerth

55,373
59
119

Thanks for pointing this out, I had not looked at deal.II! Is there any publication or piece of documentation in which deal.II's use of the TBB is described in detail? – Pedro Sep 15 '12 at 11:24
No publication, but this may help: http://www.dealii.org/developer/doxygen/deal.II/group__threads.html – Wolfgang Bangerth Sep 15 '12 at 21:05

score 4 · Answer 2 · answered Sep 13 '12 at 21:05

In my opinion, these systems have been relatively unsuccessful due primarily to the following reasons.

The naive perspective that parallel computation is about parallelizing the computation (e.g. flops) more than exposing memory locality and removing synchronization points. Even though some problems, such as dense matrix algorithms, are still FP-limited, that only occurs after careful consideration of the memory subsystem and most computational kernels (especially in the PDE world) are more memory-sensitive. Work queues tend to trade memory locality for better naive balance of flops and more atomic memory operations (due to synchronization through the queue).
Reliance on over-decomposition for dynamic load balance at the expense of strong scalability. Tasks generally have overlapping data dependencies (ghost values). As the size of the interior shrinks, the ghost/interior ratio increases. Even when this does not imply redundant work, it implies increased memory movement. Significant reductions in memory bandwidth requirements can be had by approaches such as cooperative prefetch by which multiple threads share an L1 or L2 cache by software-prefetching for their neighbor (which implicitly keeps the group of threads approximately coherent). This is exactly the opposite of over-decomposition.
Unpredictable performance, mostly due to the memory-related issues above.
Lack of library-friendly components. This can almost be summarized as not having an analogue of an MPI_Comm which allows different libraries to perform rich operations without colliding, as well as to pass the context between libraries and recover necessary attributes. The abstraction provided by the "communicator" is important for library composition regardless of whether shared or distributed memory is used.

I may be misunderstanding your answer, but the first point is exactly the opposite of what Buttari, Kurzak, Dongarra and others have shown with MAGMA, a task-based shared-memory library for dense linear algebra... Also, in your second point you refer to overlapping data, i.e. ghost values, and the surface-to-volume ratio, but these are a hold-over from distributed-memory domain decomposition schemes. I myself work with such methods for particle-based codes, and I get much better performance than MPI-based parallel implementations. — Pedro, Sep 14 '12 at 08:22
The question, in any case, was a different one... Do you know of any scientific computing software projects that use these approaches? — Pedro, Sep 14 '12 at 08:23

Task-based shared-memory parallel libraries in Scientific Computing

2 Answers2