10

In recent years, several libraries/software projects have appeared that offer some form or another of general-purpose data-driven shared-memory parallelism.

The main idea is that instead of writing an explicitly threaded code, programmers implement their algorithms as inter-dependent tasks which are then scheduled dynamically by a general-purpose middleware on a shared-memory machine.

Examples of such libraries are:

  • QUARK: Originally designed for the MAGMA parallel linear algebra library, seems to have been used for a parallel Fast Multipole Method too.

  • Cilk: Originally an MIT-based project, now supported by Intel, implemented as language/compiler extensions to C, used in the Cilkchess computer chess software and experimentally in FFTW.

  • SMP superscalar: Developed at the Barcelona Supercomputing Center, similar to Cilk in many respects, based on #pragma extensions.

  • StarPU: Similar library based "codelets" which can be compiled for and scheduled on several different architectures, including GPUs.

  • OpenMP tasks: As of version 3.0, OpenMP introduced "tasks" that can be scheduled asynchronously (see Section 2.7 of the specification).

  • Intel's Threading Building Blocks: Uses C++ classes to create and launch asynchronous tasks, see Section 11 of the Tutorial.

  • OpenCL: Supports task-based parallelism on multi-cores.

While there is a lot of literature describing the inner working of these libraries/language extensions and their application to specific problems, I have only come across very few examples of them being used in practice in scientific computing applications.

So here's the question: Does anybody know of scientific computing codes using any of these libraries/language extensions, or similar, for shared-memory parallelism?

Pedro
  • 9,573
  • 1
  • 36
  • 45
  • Are you looking for task-based parallelism? Is there a reason you skipped OpenCL and Intel TBB? I have to admit I can't tell exactly what you're looking for here. – Aron Ahmadia Sep 13 '12 at 19:00
  • 1
    @AronAhmadia: Ignorance, mainly... :) I've added TBB and OpenCL to the list, but the question is still the same: Have these, i.e. their task-based components, been used in any significant piece of software for scientific computing? – Pedro Sep 14 '12 at 08:17
  • How do we feel about turning this question and its answers into a community-wiki vs. trying to scope it further? – Aron Ahmadia Sep 14 '12 at 11:49
  • @AronAhmadia: I'm a bit worried that if I leave the question format, this will quickly degenerate into long discussions on the advantages/disadvantages of task-based and/or shared-memory programming in general. I would, however, be in favour of switching it after it's gotten a few more answers. – Pedro Sep 15 '12 at 11:36
  • The title is not appropriate. This question is about task parallelism, not shared memory. – Jeff Hammond Jul 23 '13 at 06:51

2 Answers2

8

deal.II uses the Threading Building Blocks throughout the library and by and large we're reasonably happy with it. We've looked at a few alternatives, in particular OpenMP since everyone seems to be using that for simpler codes, but found them lacking. In particular, OpenMP has the huge disadvantage that its task model does not allow you to get a handle for a task you started, and consequently it is difficult to access the state of a task (e.g. to wait for it finishing) or return values of functions you run on a separate task. OpenMP is primarily good for parallelizing the innermost loops but you gain parallel efficiency by parallelizing the outermost, complex loops, and OpenMP is not the tool for that while the TBB are reasonably good for that.

Jack Poulson
  • 7,599
  • 32
  • 40
Wolfgang Bangerth
  • 55,373
  • 59
  • 119
  • Thanks for pointing this out, I had not looked at deal.II! Is there any publication or piece of documentation in which deal.II's use of the TBB is described in detail? – Pedro Sep 15 '12 at 11:24
  • No publication, but this may help: http://www.dealii.org/developer/doxygen/deal.II/group__threads.html – Wolfgang Bangerth Sep 15 '12 at 21:05
4

In my opinion, these systems have been relatively unsuccessful due primarily to the following reasons.

  • The naive perspective that parallel computation is about parallelizing the computation (e.g. flops) more than exposing memory locality and removing synchronization points. Even though some problems, such as dense matrix algorithms, are still FP-limited, that only occurs after careful consideration of the memory subsystem and most computational kernels (especially in the PDE world) are more memory-sensitive. Work queues tend to trade memory locality for better naive balance of flops and more atomic memory operations (due to synchronization through the queue).
  • Reliance on over-decomposition for dynamic load balance at the expense of strong scalability. Tasks generally have overlapping data dependencies (ghost values). As the size of the interior shrinks, the ghost/interior ratio increases. Even when this does not imply redundant work, it implies increased memory movement. Significant reductions in memory bandwidth requirements can be had by approaches such as cooperative prefetch by which multiple threads share an L1 or L2 cache by software-prefetching for their neighbor (which implicitly keeps the group of threads approximately coherent). This is exactly the opposite of over-decomposition.
  • Unpredictable performance, mostly due to the memory-related issues above.
  • Lack of library-friendly components. This can almost be summarized as not having an analogue of an MPI_Comm which allows different libraries to perform rich operations without colliding, as well as to pass the context between libraries and recover necessary attributes. The abstraction provided by the "communicator" is important for library composition regardless of whether shared or distributed memory is used.
Jed Brown
  • 25,650
  • 3
  • 72
  • 130
  • I may be misunderstanding your answer, but the first point is exactly the opposite of what Buttari, Kurzak, Dongarra and others have shown with MAGMA, a task-based shared-memory library for dense linear algebra... Also, in your second point you refer to overlapping data, i.e. ghost values, and the surface-to-volume ratio, but these are a hold-over from distributed-memory domain decomposition schemes. I myself work with such methods for particle-based codes, and I get much better performance than MPI-based parallel implementations. – Pedro Sep 14 '12 at 08:22
  • The question, in any case, was a different one... Do you know of any scientific computing software projects that use these approaches? – Pedro Sep 14 '12 at 08:23
  • There are a handful of projects using these systems, but I don't think the approach can be considered "successful". 2. The dependencies are still overlapping in shared memory. Look at the way tcmalloc or the Linux kernel makes threads more independent to avoid bottlenecks such as synchronization through atomics. Shared address space does not imply you should operate as though you had uniform memory or that you should consider atomics to be inexpensive.
  • – Jed Brown Sep 14 '12 at 13:30
  • I don't know what "fair comparison" you intend to be citing, but PLASMA only gets about 25% of peak FPU (e.g. slide 5 of http://hpcgarage.org/cscads2012/Luszczek-UTK-PowerTools.pdf) which would be upublishably bad for the same operation in distributed memory where at least 70% of peak would be expected. Dense linear algebra is an FPU-bound case that I specifically cited as a possible exception, but despite the enormous matrix sizes, PLASMA is obviously far from being FPU-bound.
  • – Jed Brown Sep 14 '12 at 13:35
  • Pedro, most physics has a long range component, so particles are coupled with an update which is subject to the surface-to-solume effect above (PPPM, vortex particle, etc) – Matt Knepley Sep 14 '12 at 13:54
  • @JedBrown: I don't know what "successful" or "fair comparison" you're quoting, as I never used those terms. Again, with regards to MAGMA, it's still roughly twice as fast as Intel's MKL. And again, I'd like to reiterate that the question is not whether such approaches are good or not, but if there are software packages out there in which they have been used. – Pedro Sep 15 '12 at 11:19
  • @MattKnepley: I know, but the particle-mesh and mesh-particle interpolations, which take a significant chunk of the computation time, are also just short-ranged interactions. The rest is then usually just FFTs or sparse linear algebra. There are, of course, tree-codes, but this also already appears to have been done with QUARK. – Pedro Sep 15 '12 at 11:22
  • @Pedro You claimed that my point 1 was refuted by MAGMA (or PLASMA?), which would normally imply a comparison or demonstrable optimality. Comparing to MKL is contrived because MKL does not target that environment. Vendor BLAS implementations systematically under-performed prior to Goto demonstrating how to use L2 effectively. If back then, we applied the same reasoning you seem to be using here, we would have come to some false conclusions. As for the FMM paper, most of Rio's work has been with ultra-scalable distributed memory implementations, which the quark-based paper does not compare to. – Jed Brown Sep 15 '12 at 12:46
  • @JedBrown: This is straying from the original question, but here is a comparison of PLASMA with LAPACK (using MKL/ESSL's threaded BLAS), ScaLAPACK (using MPI) and Intel/IBM's own threaded math libraries on two different 16/32 core Intel/Power architectures. PLASMA beats them in almost all cases. Unless you can point me to a direct comparison with a different parallel library that says otherwise, I won't be convinced that task-based parallelism is a bad idea (in this case). – Pedro Sep 15 '12 at 13:19
  • @Pedro Someone publishing that they can solve problem X using method M should not be interpreted as the authors claiming that M is the best way to solve problem X. ;-) The Elemental paper reaches a higher fraction of peak using 8x more cores on a smaller problem (that uses less than 1% of memory). From their experience, the larger problem sizes easily deliver high fractions of peak, but due to cubic complexity, burn through allocations. We may get access to run Elemental on the Altix so we can provide your desired direct comparison. – Jed Brown Sep 15 '12 at 15:27
  • @JedBrown: No, but a peer-reviewed paper showing that method M can solve problem X faster than methods N, P, and Q, can be interpreted as M being better than N, P, and Q for that problem. It's hard to compare the Elemental results with those in the PLASMA paper, since the number of CPUs and the problem size differ quite extremely. For the Cholesky decomposition at 10k rows, the only result in common, Elemental is as fast as ScaLAPACK on 8192 cores. PLASMA is 2.5x and 1.5x faster on 16/32 cores respectively. – Pedro Sep 15 '12 at 15:47