How to implement an efficient _mm256_madd_epi8?

Question

Intel provides a C style function named _mm256_madd_epi16, which basically

__m256i _mm256_madd_epi16 (__m256i a, __m256i b)

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.

Now I have two __m256i variables, each of them has 32 8-bit int in it.

I want to implement the same functionality as _mm256_madd_epi16 does, but each int32_t element in the result __m256i is the sum of four products of signed char instead of two pairs of signed int16_t.

I can do that in a scalar loop:

  alignas(32) uint32_t res[8];
  for (int i = 0; i < 32; ++i)
      res[i / 4] += _mm256_extract_epi8(a, i) * _mm256_extract_epi8(b, i);
  return _mm256_load_si256((__m256i*)res);

Note that the multiply result is sign-extended to int before adding, and that the _mm256_extract_epi8 helper function¹ returns signed __int8. Nevermind that the total is uint32_t instead of int32_t; it can't overflow anyway with only four 8x8 => 16-bit numbers to add.

It looks very ugly, and doesn't runs efficiently unless the compiler does some magic to do it with SIMD instead of compiling as written to scalar extraction.

Footnote 1: _mm256_extract_epi8 is not an intrinsic. vpextrb only works for the low lane of a 256-bit vector, and this helper function may allow an index that isn't a compile-time constant.

Note that `int` is not a fixed-size integer type. If you want a 32-bit unsigned type use `uint32_t` explicitly. — Some programmer dude, Jul 17 '18 at 13:13
@Someprogrammerdude: All compilers that implement Intel's intrinsics have 32-bit `int`, so it's not really wrong to make assumptions about `int` in code that uses intrinsincs. [`_mm_cvtsi128_si32` (`movd`) and many other intrinsics actually return `int`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4090,1810&text=int%2520_mm_), and other take an `int` arg instead of `int32_t`. By specifying intrinsics this way, Intel basically requires an ABI where `int` is (at least?) 32-bit for them to be supported. All that said, I do typically use `int32_t` if that's what I mean. — Peter Cordes, Jul 17 '18 at 13:59
@AmorFati: Can you use `pmaddubsw` as a building block for your use-case? It's a vertical-multiply / horizontal-add like `pmaddwd`, but with one input considered unsigned. So it might not be usable for your use-case. — Peter Cordes, Jul 17 '18 at 14:02
Your sample implementation doesn't make sense. Did you mean `+=` instead of `=`? Because you currently have 3 dead stores before only keeping the result from the 4th byte of each group. Also, you have a strict-aliasing violation using a `uint32_t*` to access an object of a different type. Only `char*` and `__m256i*`/`__m128*` / other `__m... *` pointers can safely alias other objects. — Peter Cordes, Jul 17 '18 at 14:05
Also note that `extract` needs a compile-time-constant index, so this will only compile if the compiler fully unrolls the loop. — Peter Cordes, Jul 17 '18 at 14:08
@PeterCordes Great explanation, thanks a lot! The code does compile with clang even if I remove -O3 flag. — Amor Fati, Jul 17 '18 at 14:18
I don't see the point of that update with `_mm256_add_epi32`. Are you saying `__m256i sum = _mm256_madd_epi8(foo, bar);` gives a different result from adding to `setzero()`? If that's the case, you did something wrong or your compiler is broken. Or are you just showing the `sum = add(sum, madd_epi8(foo,bar))` that you'd use inside a loop? — Peter Cordes, Jul 20 '18 at 02:56
You should probably ask a different question about debugging, showing a full [mcve] where you get different results one way vs. the other. Don't clutter up this question with debugging, leave this one about optimizing for one pair of vectors. — Peter Cordes, Jul 20 '18 at 03:01

Peter Cordes · Accepted Answer · 2018-07-20T02:32:48.790

If one of your inputs is known to always be non-negative, you can use pmaddubsw; the 8->16 bit equivalent of pmaddwd. It does signed saturation to 16 bits if the sum overflows, which is possible, so you might need to avoid it if that's a problem for your case.

But otherwise, you can pmaddubsw and then manually sign-extend the 16-bit elements to 32 and add them. Or use pmaddwd against _mm256_set1_epi16(1) to hsum pairs of elements with correct handling of the signs.

The obvious solution would be to unpack your input bytes to 16-bit elements with zero or sign-extension. Then you can use pmaddwd twice, and add the results.

If your inputs are coming from memory, loading them with vpmovsxbw might make sense. e.g.

__m256i a = _mm256_cvtepi8_epi16(_mm_loadu_si128((const __m128i*)&arr1[i]);
__m256i b = _mm256_cvtepi8_epi16(_mm_loadu_si128((const __m128i*)&arr2[i]);

But now you have the 4 bytes that you want spread out across two dwords, so you'd have to shuffle the result of one _mm256_madd_epi16(a,b). You could maybe use vphaddd to shuffle and add two 256-bit vectors of products into one 256-bit vector of results you want, but that's a lot of shuffling.

So instead, I think we want to generate two 256-bit vectors from each 256-bit input vector: one with the high byte in each word sign-extended to 16, and the other with the low byte sign extended. We can do that with 3 shifts (for each input)

 __m256i a = _mm256_loadu_si256(const  __m256i*)&arr1[i]);
 __m256i b = _mm256_loadu_si256(const  __m256i*)&arr2[i]);

 __m256i a_high = _mm256_srai_epi16(a, 8);     // arithmetic right shift sign extends
     // some compilers may only know the less-descriptive _mm256_slli_si256 name for vpslldq
 __m256i a_low =  _mm256_bslli_epi128(a, 1);   // left 1 byte = low to high in each 16-bit element
         a_low =  _mm256_srai_epi16(a_low, 8); // arithmetic right shift sign extends

    // then same for b_low / b_high

 __m256i prod_hi = _mm256_madd_epi16(a_high, b_high);
 __m256i prod_lo = _mm256_madd_epi16(a_low, b_low);

 __m256i quadsum = _m256_add_epi32(prod_lo, prod_hi);

As an alternative to vplldq by 1 byte, vpsllw by 8 bits __m256i a_low = _mm256_slli_epi16(a, 8); is the more "obvious" way to shift low to high within each word, and may be better if the surrounding code bottlenecks on shuffles. But normally it's worse because this code heavily bottlenecks on shift + vec-int multiply.

On KNL, you could use AVX512 vprold z,z,i (Agner Fog doesn't show a timing for AVX512 vpslld z,z,i) because it doesn't matter what you shift or shuffle into the low byte of each word; this is just setup for an arithmetic right shift.

Execution port bottlenecks:

Haswell runs vector shifts and vector-integer multiply only on port 0, so this badly bottlenecks on that. (Skylake is better: p0/p1). http://agner.org/optimize/.

We can use a shuffle (port 5) instead of the left shift as setup for an arithmetic right shift. This improves throughput and even reduces latency by reducing resource conflicts.

But we can avoid the shuffle control vector by using vpslldq to do a vector byte shift. It's still an in-lane shuffle (shifting in zeros at the end of each lane), so it still has single-cycle latency. (My first idea was vpshufb with a control vector like 14,14, 12,12, 10,10, ..., then vpalignr, then I remembered that simple old pslldq has an AVX2 version. There are two names for the same instruction. I like _mm256_bslli_epi128 because the b for byte-shift distinguishes it as a shuffle, unlike the within-element bit-shifts. I didn't check which compiler supports what name for the 128-bit or 256-bit versions of the intrinsic.)

This also helps on AMD Ryzen. Vector shifts only run on one execution unit (P2), but shuffles can run on P1 or P2.

I haven't looked at AMD Ryzen execution port conflicts, but I'm pretty sure this won't be worse on any CPU (except KNL Xeon Phi, where AVX2 ops on elements smaller than a dword are all super slow). Shifts and in-lane shuffles are the same number of uops and same latency.

If any elements are known non-negative, sign-extend = zero-extend

Zero-extending is cheaper than manually sign-extending, and avoids port bottlenecks. a_low and/or b_low can be created with _mm256_and_si256(a, _mm256_set1_epi16(0x00ff)).

a_high and/or b_high can be created with a shuffle instead of shift. (pshufb zeros the element when the shuffle-control vector has its high bit set).

 const _mm256i pshufb_emulate_srl8 = _mm256_set_epi8(
               0x80,15, 0x80,13, 0x80,11, ...,
               0x80,15, 0x80,13, 0x80,11, ...);

 __m256i a_high = _mm256_shuffle_epi8(a, pshufb_emulate_srl8);  // zero-extend

Shuffle throughput is also limited to 1 per clock on mainstream Intel, so you could bottleneck on shuffles if you go overboard. But at least it's not the same port as the multiply. If only the high bytes are known non-negative, replacing vpsra/lw with vpshufb could help. Unaligned loads so those high bytes are low bytes could be more helpful, setting up for vpand for a_low and/or b_low.

`pmaddubsw`: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)

It treats one input as signed, the other as unsigned, and does i8 x u8 => i16, then adds horizontal pairs to make 16-bit integers (with signed saturation because the sums can overflow. This might also rule it out for your use-case).

But possibly just use it and then add horizontal pairs with pmaddwd against a constant 1:

__m256i sum16 = _mm256_maddubs_epi16(a, b);
__m256i sum32 = _mm256_madd_epi16(sum16, _mm256_set1(1));

(pmaddwd for a horizontal 16=>32-bit sum is maybe higher latency than shift / and / add, but does treat everything as signed. And it's only a single uop so it's good for throughput.)

I have a lot to learn, but I get your point. Cool solution! Thanks again! — Amor Fati, Jul 17 '18 at 15:01
@AmorFati: I thought of some optimizations to reduce bottlenecks on port 0 on Intel, or port 2 on Ryzen . If you weren't bottlenecking on memory, or something in the surrounding code, then my original version was heavily bottlenecked on shift and multiply uops, especially on Haswell. (Not so bad on Skylake.) — Peter Cordes, Jul 18 '18 at 00:39
Thanks for your detailed answer, really help me a lot. The reference like Agner's blog and the other links are also helpful. — Amor Fati, Jul 18 '18 at 03:00
I have another problem. I wanna implement a special `_mm256_packs_epi32` which takes 4 _m256i as input, and then convert every int32_t in the input to char variable. I found some intrinsic instructions like `_mm256_mask_permutex2var_epi8` may help but I only work on avx2, any tips? Thanks again. — Amor Fati, Jul 18 '18 at 03:13
@AmorFati: With AVX512 you'd want `_mm256_cvtepi32_epi8` to truncate, or [`_mm512_cvtsepi32_epi8` to pack with signed saturation](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4090,2293,1764&text=vpmovsdb). Which do you want, truncation or saturation? If you were looking at shuffles to grab the low byte, that would only give you truncation, and that's easy: `_mm256_shuffle_epi8` to get the data you want into 4 bytes at the bottom of each 128-bit lane, then `vpunpckldq` with another vector. Or with 4 different byte-shuffle masks, `vpblendd` is cheaper than unpack. — Peter Cordes, Jul 18 '18 at 03:23
That will interleaves your data, so use one lane-crossing `vpermd` at the end to rearrange 4-byte chunks across lanes. If you actually want signed saturation, then use `vpackssdw` on 2 pairs of inputs, then `vpacksswb`, then `vpermq` to rearrange after the in-lane packs. See [SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers](https://stackoverflow.com/a/34076778) for more about packing, but that's only for 128-bit vectors and with unsigned. Still, understanding how that works will explain how to use `vpackssdw` and so on. — Peter Cordes, Jul 18 '18 at 03:26
Hey Peter, I just found that _mm256_srli_epi16 is shifting in zeros, is it a typo? — Amor Fati, Jul 20 '18 at 02:28
@AmorFati: Oops, yes, I meant `_mm256_srai_epi16` (arithmetic, not logical). Thanks for reporting that bug so I could fix the answer. — Peter Cordes, Jul 20 '18 at 02:33

How to implement an efficient _mm256_madd_epi8?

1 Answers1

Execution port bottlenecks:

If any elements are known non-negative, sign-extend = zero-extend

pmaddubsw: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)

`pmaddubsw`: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)