I'm looking for a library to exploit parallelism within a single heterogeneous computing node (possibly using Accelerators like Xeon Phi or nVidia's GPGPU's) in a C++ FV/DG code using hierarchical octree-like grids. It should
- support multiple back-ends (e.g. OpenCL, CUDA, OpenMP, OpenACC, ...)
- hopefully be generic enough to support back-ends from the future,
- be easy to install/configure,
- be easy to use.
Linear algebra would be nice, but the library should at least be able to do a simple transform with a user defined kernel on a computing device:
auto vd = device_vector<double>{ 11., 22., 33., 44. };
transform(vd, begin(vd), [](double vd_i){ return 2. * vd_i; });
host_vector<double> vh = vd; // no-op if the device is the CPU
for (auto vh_i&& : vh) { cout << vh_i << "\n"; } // 22, 44, 66, 88
I've looked at Intel TBB, openMP, openACC, AMD's bolt, and nVidia's Thrust.
Thrust seems to be the best fit for my application because:
- it provides different backends: CUDA, TBB, and OpenMP (no OpenCL),
- it has a familiar STL-like interface: host/device containers, iterators, and algorithms,
- the documentation seems nice.
However, I have no experience at all (and don't know anyone who has) building an hybrid MPI-Thrust application.
So to my question:
- Is there any other library worth looking into that might fit my needs better?
- Does anyone has experience with hybrid MPI-Thrust applications that can comment on how good of a fit Thrust is for such a thing?
That said, check out chapter 11 of the manual describing how to generate automated kernels. I'm not sure if that would work with CUDA backends though .... Karl Rupp, creator of ViennaCL, is here so maybe he could enlighten us on this :)
– mmirzadeh Oct 14 '13 at 18:27