The Haswell architectures comes up with several new instructions. One of them is PEXT (parallel bits extract) whose functionality is explained by this image (source here):
It takes a value r2 and a mask r3 and puts the extracted bits of r2 into r1.
My question is the following: what would be the equivalent code of an optimized templated function in pure standard C++11, that would be likely to be optimized to this instruction by compilers in the future.