communication penalty when using wide stencils in parallel computations

Question

When reading about discontinuous Galerkin methods one finds the argument that these methods allow higher-order accuracy while maintaining a compact stencil (a cell only communicates with its direct neighbors) and that this is beneficial for parallel computations.

I can understand why a wider stencil would be bad for parallelization with domain decomposition: it would require more than one layer of overlap and thereby increase the communication cost. But how big is this penalty in practice?

score 3 · Answer 1 · answered Mar 22 '15 at 12:59

3

I think that from a practical perspective, it's not an important point. Sure, you have less data to send around, and to fewer neighbors, but I don't think I've ever seen anyone quantify the impact in any meaningful way.

Papers about DG tend to repeat the same arguments in favor of DG methods over and over without attribution to a source and without providing quantitative backup for their claims. This includes the one you cite here. Not all of these claims will stand up to critical review if someone were to actually compare DG codes against the corresponding Continuous Galerkin implementation.

answered Mar 22 '15 at 12:59

Wolfgang Bangerth

55,373
59
119

For critical review, it can be worthwhile to compare DG to Continuous Galerkin (CG) to high order Finite difference on different architectures. A common conclusion is that each one can be optimized for a particular architecture in a certain way. For example, the structure of high order (wide-stencil) FD is well suited towards OpenMP threading and CPU architectures in terms of efficient cache use, since the stencil is wide, but only in one particular direction. Each method can be made efficient in a different manner on differing architectures. – Jesse Chan Mar 22 '15 at 16:22
@JesseChan: Sure, but then all performance measures would be vacuous. I think a fair comparison would involve a run-of-the-mill current x86_64 cluster with a few hundred processor cores. This is available to most research groups and industrial researchers today, and will continue to be for a while. – Wolfgang Bangerth Mar 22 '15 at 23:09
I see. That is certainly middle of the road and would provide a relatively even baseline performance estimate. – Jesse Chan Mar 23 '15 at 03:54
So what's the problem in quantifying the difference on such a cluster? Both continuous and discontinuous Galerkin have been around for a while, and no one has compared the two in terms of parallel performance? – chris Mar 23 '15 at 07:43
I believe @WolfgangBangerth has commented on the blanket statement that "DG is more parallelizable" previously - http://scicomp.stackexchange.com/questions/11067/is-discontinuous-galerkin-really-any-more-parallelizable-than-continuous-galerki
As you can see, the answer to that question is fairly contested even in the answer thread.
– Jesse Chan Mar 23 '15 at 17:34
The issue with blanket statements like that is that there are special cases for which you choose a specific method. For example, in explicit time-stepping, if you can deal with a tensor-product grid, CG in the form of the Spectral Element Method is often the favorite. On more general meshes, DG can be more scalable (see https://www3.nd.edu/~coast/reports_papers/2009-JSC-kbdwm.pdf), especially at high order. For time-implicit or steady problems, you have to consider the solver or preconditioner as well. If you can specify metrics (i.e. matrix assembly) it may be easier to compare the two. – Jesse Chan Mar 23 '15 at 17:38
@chris: Well, yes, it shouldn't be that hard to do. But I would think that the onus on doing such a comparison should be on the proponents of the more complicated (and newer) method, i.e., those who write papers on DG methods that cite this statement. Unfortunately, that part of our community never seems to have taken up this challenge (or at least I don't know a reference). – Wolfgang Bangerth Mar 23 '15 at 18:29
@JesseChan: thanks for the link to the JSC paper, section 3.2 is quite convincing. – chris Mar 26 '15 at 13:13

communication penalty when using wide stencils in parallel computations

1 Answers1