Get sum of values stored in __m256d with SSE/AVX

Question

Is there a way to get sum of values stored in __m256d variable? I have this code.

acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm256_hadd_pd(acc, acc);
result[i] = ((double*)&acc)[0] + ((double*)&acc)[2];

This code works, but I want to replace it with SSE/AVX instruction.

My answer on the linked duplicate doesn't actually have an AVX `_pd` version, but it does show how to extract the high 128 for a `__m256` vector. And it explains the general idea for horizontal reductions of narrowing by half every step. — Peter Cordes, Apr 20 '18 at 12:54
BTW, consider producing 4 results for `result[i + 0..3]` in parallel, if convenient, to avoid needing a horizontal sum for every element. — Peter Cordes, Apr 20 '18 at 12:56
Have re-opened as I'd just prepared a solution for doubles that I wanted to post as an answer. — Paul R, Apr 20 '18 at 12:58
Usual technique: write scalar code, compile, read the generated asm (in this case, gcc generates vhaddpd+vperm2f128+vaddpd). — Marc Glisse, Apr 20 '18 at 13:09
@MarcGlisse: What about with `-mtune=znver1`? Does that get it to use `vextractf128` instead of `vpermf128`? Seems like a weird choice in general, even if tuning for an Intel CPU, let alone `tune=generic`. — Peter Cordes, Apr 20 '18 at 13:18
In my test the data is in memory at the beginning, instead of in a register. Tuning for znver1 then either doesn't vectorize, or vectorizes for 128 bits (which actually looks nicer than the 256 bit code). I didn't immediately find a version that would produce vextractf128. — Marc Glisse, Apr 20 '18 at 14:58

Peter Cordes · Accepted Answer · 2021-01-20T23:38:02.967

It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all.

For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction): extract the high half and add to the low half. Repeat until you're down to 1 element.

If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same). That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.

For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:

#include <immintrin.h>

inline
double hsum_double_avx(__m256d v) {
    __m128d vlow  = _mm256_castpd256_pd128(v);
    __m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
            vlow  = _mm_add_pd(vlow, vhigh);     // reduce down to 128

    __m128d high64 = _mm_unpackhi_pd(vlow, vlow);
    return  _mm_cvtsd_f64(_mm_add_sd(vlow, high64));  // reduce to scalar
}

If you wanted the result broadcast to every element of a __m256d, you'd use vshufpd and vperm2f128 to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd to swap, then vinsertf128 to get a 256-bit vector. Or with AVX2, vbroadcastsd on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd.

Compiled with gcc7.3 -O3 -march=haswell on the Godbolt compiler explorer

    vmovapd         xmm1, xmm0               # silly compiler, vextract to xmm1 instead
    vextractf128    xmm0, ymm0, 0x1
    vaddpd          xmm0, xmm1, xmm0
    vunpckhpd       xmm1, xmm0, xmm0         # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
    vaddsd          xmm0, xmm0, xmm1         # scalar means we never raise FP exceptions for results we don't use
    vzeroupper
    ret

After inlining (which you definitely want it to), vzeroupper sinks to the bottom of the whole function, and hopefully the vmovapd optimizes away, with vextractf128 into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128 result.

On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128 is 1 uop with 1c latency, and 0.33c throughput.

@PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)

On Zen 1, vperm2f128 is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.

     # Paul's version                      # Ryzen      # Skylake
    vhaddpd       ymm0, ymm0, ymm0         # 8 uops     # 3 uops
    vperm2f128    ymm1, ymm0, ymm0, 49     # 8 uops     # 1 uop
    vaddpd        ymm0, ymm0, ymm1         # 2 uops     # 1 uop
                           # total uops:   # 18         # 5

vs.

     # my version with vmovapd optimized out: extract to a different reg
    vextractf128    xmm1, ymm0, 0x1        # 1 uop      # 1 uop
    vaddpd          xmm0, xmm1, xmm0       # 1 uop      # 1 uop
    vunpckhpd       xmm1, xmm0, xmm0       # 1 uop      # 1 uop
    vaddsd          xmm0, xmm0, xmm1       # 1 uop      # 1 uop
                           # total uops:   # 4          # 4

Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)

The latency is not worse, either, so there's really no reason to use an inefficient hadd / vpermf128 version.

Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128 is no longer about as cheap as vmovdqa xmm. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.

Paul R · Answer 2 · 2018-04-22T16:42:38.743

5

You can do it like this:

acc = _mm256_hadd_pd(acc, acc);    // horizontal add top lane and bottom lane
acc = _mm256_add_pd(acc, _mm256_permute2f128_pd(acc, acc, 0x31));  // add lanes
result[i] = _mm256_cvtsd_f64(acc); // extract double

Note: if this is in a "hot" (i.e. performance-critical) part of your code (especially if running on an AMD CPU) then you might instead want to look at Peter Cordes's answer regarding more efficient implementations.

edited Apr 22 '18 at 16:42

answered Apr 20 '18 at 12:57

Paul R

202,568
34
375
539

1

On Ryzen this is *terrible*, BTW. permutef128 is *much* slower than extract, and there's no reason to do any 256-bit vector operations for this because you want a single scalar result and don't need the result broadcast to every element. On Intel the only sub-optimal part is `hadd`. – Peter Cordes Apr 20 '18 at 13:13
This is exactly what I looked for! Thanks! – Peter Apr 20 '18 at 13:14
Ah - don't know anything about AMD CPUs - what do you suggest as an alternative ? – Paul R Apr 20 '18 at 13:14
What I said in comments and in [Fastest way to do horizontal float vector sum on x86](//stackoverflow.com/q/6996764): extract down to 128, then do a 128-bit hsum the SSE2 way: with a shuffle + add. Not worse for Intel, and much better for AMD. `vextractf128` is nearly free on AMD, according to Agner Fog's numbers: it's not even a shuffle because vectors are already split in half internally. – Peter Cordes Apr 20 '18 at 13:15
Your version on Ryzen: 18 uops. My version: 4. It's a surprisingly big difference if you haven't thought about CPUs that split 256b ops to 128. Mine saves a uop on Intel as well from avoiding `hadd`. (AMD has worse `hadd` than I'd remembered; apparently it's a "mixed domain" instruction: integer / fp. Like they don't even care much about making it fast or something, or at least the 256-bit version which is probably more rarely useful because of its in-lane behaviour. This case doesn't count as useful because you can do the same thing faster with a separate single shuffle.) – Peter Cordes Apr 20 '18 at 14:09
Thanks Peter - useful to know - we don't currently do anything with AMD CPUs but I have to keep an eye on them as the technology evolves - they do seem to be catching up with Intel again gradually. – Paul R Apr 20 '18 at 14:21
2

Yeah, for your own use you don't need to care about AMD if you don't use them. But for SO answers, it's definitely a good idea to write code that's efficient on AMD as well, especially when it's equal on Intel. I think it's healthier for the industry if Intel continues to have some competition, and having random pieces of software that copy code from SO answers not gimp themselves on AMD CPUs helps with that. And the people that copy your code will probably be happier if their code runs well on more CPUs :P Fortunately, reducing down to 128 first is good and easy to remember. – Peter Cordes Apr 21 '18 at 02:47

Get sum of values stored in __m256d with SSE/AVX

2 Answers2

Linked

Related