6

Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.

So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:

// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);

// ...

// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),
  gHigh32Permute); // This seems to take 3 cycles
Peter Cordes
  • 286,368
  • 41
  • 520
  • 731
Serge Rogatch
  • 12,441
  • 6
  • 72
  • 134
  • 1
    So you want to extract the odd or even-numbered 32-bit elements? i.e. like AVX512 `_mm256_cvtepi64_epi32` (`vpmovqd`)? I don't think you're going to beat 1 shuffle instruction with 3-cycle latency, because lane-crossing shuffles always have 3c latency on Intel CPUs. Your `vpermd` solution has single-cycle throughput. – Peter Cordes Aug 24 '17 at 17:24
  • If you need it to be faster, you're going to have to make the surrounding code use it less, or not require lane-crossing or something! Or maybe somehow pack two sources into a 256b result with `shufps` (except it's not lane-crossing so it doesn't solve your problem, and there's no `vpackqd` instruction and pack instructions aren't lane-crossing either.) – Peter Cordes Aug 24 '17 at 17:27
  • @PeterCordes, yes, I want to extract odd- or even-numbered 32-bit elements from a 256-bit register to a 128-bit register. Thanks for the reference to AVX512! I don't have it on Ryzen 1800X, but looking forward to migrate to it once... These 32-bit elements are high and low parts of 64-bit double's, so I don't see a way to change the surrounding code. – Serge Rogatch Aug 24 '17 at 18:08
  • Well do they have to be in a `__m128i`, or can you use an in-lane shuffle to put the low and high halves into the bottom 2 elements of each lane of a `__m256i`? If you're tuning for Ryzen, it probably does make sense to get it down to 128b, though. But maybe `vextractf128` and then use a 2-source shuffle (like `shufps`) will be better on Ryzen, where lane-crossing shuffles are very slow. – Peter Cordes Aug 24 '17 at 18:15

1 Answers1

5

On Intel, your code would be optimal. One 1-uop instruction is the best you will get. (Except you might want to use vpermps to avoid any risk for int / FP bypass delay, if your input vector was created by a pd instruction rather than a load or something. Using the result of an FP shuffle as an input to integer instructions is usually fine on Intel, but I'm less sure about feeding the result of an FP instruction to an integer shuffle.)

Although if tuning for Intel, you might try changing the surrounding code so you can shuffle into the bottom 64-bits of each 128b lane, to avoid using a lane-crossing shuffle. (Then you could just use vshufps ymm, or if tuning for KNL, vpermilps since 2-input vshufps is slower.)

With AVX512, there's _mm256_cvtepi64_epi32 (vpmovqd) which packs elements across lanes, with truncation.


On Ryzen, lane-crossing shuffles are slow. Agner Fog doesn't have numbers for vpermd, but he lists vpermps (which probably uses the same hardware internally) at 3 uops, 5c latency, one per 4c throughput.

vextractf128 xmm, ymm, 1 is very efficient on Ryzen (1c latency, 0.33c throughput), not surprising since it tracks 256b registers as two 128b halves already. shufps is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.

This also saves you 2 registers for the 2 vpermps shuffle masks you don't need anymore.

So I'd suggest:

__m256d x = /* computed here */;

// Tuned for Ryzen.  Sub-optimal on Intel
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));
__m128 odd  = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));

On Intel, using 3 shuffles instead of 2 gives you 2/3rds of the optimal throughput, with 1c extra latency for the first result.

Peter Cordes
  • 286,368
  • 41
  • 520
  • 731
  • I've measured that `const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(_mm256_castpd_si256(x), gHigh32Permute));` is faster than `const __m128i high32 = _mm_castps_si128( _mm256_castps256_ps128(_mm256_permutevar8x32_ps(_mm256_castpd_ps(x), gHigh32Permute) ));` . So perhaps there is also a penalty for `double` to `float` bypass? – Serge Rogatch Aug 26 '17 at 21:22
  • @SergeRogatch: Unlikely for shuffles. More likely, `vpermd` just performs differently from `vpermps`. (Agner didn't list them both so I had to guess). Or that whatever you're consuming the result with does better when it's coming from an integer shuffle? AMD has had float vs. double differences for actual FP math instructions, though, according to Agner. (Almost always irrelevant of course, but it's a clue about the internal implementation, like maybe there's some extra tag bits stored with a vector.) – Peter Cordes Aug 26 '17 at 21:25
  • Shouldn't `hi` and `lo` be swapped in `__m128 odd = _mm_shuffle_ps(hi, lo, _MM_SHUFFLE(3,1,3,1));` ? – Serge Rogatch Aug 26 '17 at 22:17
  • @SergeRogatch: good catch, yeah the low 2 elements of the result come from the first source operand. – Peter Cordes Aug 26 '17 at 22:25
  • Confirmed in debug: `(lo, hi, ...)` is the right order. – Serge Rogatch Aug 26 '17 at 22:54
  • 1
    @SergeRogatch: you said something about confusing documentation... See http://felixcloutier.com/x86/SHUFPS.html (or the original Intel vol.2 PDF it was extracted from for instructions where the diagrams get messed up). The "Operation" section has detailed pseudocode for everything, and often there are good diagrams and tables. (e.g. for cmpps, look at cmppd because it's the alphabetically first, so they put the good stuff there.) The online "intrinsics finder" is good, but sometimes has a mistake or leaves out some important detail. And it never has diagrams. – Peter Cordes Aug 26 '17 at 22:58