Standard C++11 code equivalent to the PEXT Haswell instruction (and likely to be optimized by compiler)

Question

The Haswell architectures comes up with several new instructions. One of them is PEXT (parallel bits extract) whose functionality is explained by this image (source here):

pext

It takes a value r2 and a mask r3 and puts the extracted bits of r2 into r1.

My question is the following: what would be the equivalent code of an optimized templated function in pure standard C++11, that would be likely to be optimized to this instruction by compilers in the future.

I don't think compilers will ever try to work out if a function is doing exactly that and compile it to a single instruction. If there was some built-in way to do it, sure, but everybody will have different implementations. So I'm not sure your question is answerable as it is. — Joseph Mansfield, Jan 15 '14 at 17:32
I can't think of a reason why you would need to do that. And why a template? — graham.reeds, Jan 15 '14 at 17:34
Wish I could downvote comments. Graham, just because you don't have a need to twiddle bits around doesn't invalidate the question. When you are writing low-level code (compression? graphics?), fast bit juggling can be hugely valuable. — StilesCrisis, Jan 15 '14 at 20:25
Duplicate of this question in disguise? http://stackoverflow.com/questions/21141848/ Is this still meant for a supercomputer doing this billions of iterations? — user541686, Jan 16 '14 at 10:50
@Mehrdad your comment is invalid. Vincent, if you want this instruction, you could use it directly via inline assembly, instead of working around it in C++. this way, you can be certain that the compiler does what you want, and it should be much clearer to read. what compiler are you using? — Andreas Grapentin, Jan 16 '14 at 10:56
Some people (like me) need to support multiple computing architectures with their bit fiddling code... For people like us, it would really be nice to be able know the coding pattern that a compiler would automatically recognize as corresponding to pext... Especially if that coding pattern would be equally recognizable for optimization in other architectures than x86_64. — Michael Back, Mar 29 '21 at 10:46

score 4 · Accepted Answer · answered Jan 16 '14 at 10:44

Here is some code from Matthew Fioravante's stdcxx-bitops GitHub repo that was floated to the std-proposals mailinglist as a preliminary proposal to add a constexpr bitwise operations library for C++.

#ifndef HAS_CXX14_CONSTEXPR
#define HAS_CXX14_CONSTEXPR 0
#endif

#if HAS_CXX14_CONSTEXPR
#define constexpr14 constexpr
#else
#define constexpr14
#endif

//Parallel Bits Extract
//x    HGFEDCBA
//mask 01100100
//res  00000GFC
//x86_64 BMI2: PEXT
template <typename Integral>
constexpr14 Integral extract_bits(Integral x, Integral mask) {
  Integral res = 0;
  for(Integral bb = 1; mask != 0; bb += bb) {
    if(x & mask & -mask) {
      res |= bb;
    }
    mask &= (mask - 1);
  }
  return res;
}

This is interesting if it optimizes to pext on x86-64... Does it also optimize to "something cool" on other architectures (my main application is morton decoding). Also, does this optimize as well in C on gcc or clang (without the use of templates)? — Michael Back, Mar 29 '21 at 10:50

Standard C++11 code equivalent to the PEXT Haswell instruction (and likely to be optimized by compiler)

1 Answers1

Linked