18

I have a __m256d vector packed with four 64-bit floating-point values.
I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value;

My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved wrong on this last statement.

So the ideal solution will:
1) only use only AVX instructions.
2) minimize the number of instructions. (I am hoping for no more than 3-4 instructions)

Having said that, any elegant/efficient solution will be accepted, even if it doesn't adhere to the above guidelines.

Thanks for any help.

-Luigi

Luigi Castelli
  • 634
  • 1
  • 6
  • 13
  • 3
    That's a tough one... Are you doing this with only 1 vector? Or do you have many vectors for which you need to find the max? You can (fairly) efficient do 4 of these in parallel with a 4 x 4 vector transpose... – Mysticial Mar 20 '12 at 22:28
  • @Mysticial: Well... I am dealing with many vectors. However the simplicity of the processing doesn't justify two 4x4 transpose operations for every iteration. So I am processing everything "horizontally" without transposition. I am getting a great speed-up that way, close to 4x, because I am avoiding the overhead of transposing. Everything is in a tight loop manually unrolled 4 times. However, when the loop is over I am left with one last AVX vector. I have to find the greatest of its four elements in order to store the result back into my double-precision scalar value. Hence my question... – Luigi Castelli Mar 20 '12 at 22:56
  • If it isn't in the "tight loop", is it even performance critical? – Mysticial Mar 20 '12 at 22:59
  • This time, not really... :) but I know I will run into a situation where it will be performance critical. That's why I formulated the question the way I did... – Luigi Castelli Mar 20 '12 at 23:01
  • 1
    Ah :) In that case, the best way to do this would probably be highly specific to how it's used. In other words, it's not vectorizable at this level, but can you push it to a higher level... – Mysticial Mar 20 '12 at 23:09
  • What do you mean by "pushing it to a higher level" ? – Luigi Castelli Mar 20 '12 at 23:13
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9115/discussion-between-luigi-castelli-and-mysticial) – Luigi Castelli Mar 20 '12 at 23:18
  • 2
    Note that you can stay in the AVX domain while using 128 bit instructions. There are actually 3 kind of instructions: AVX256, AVX128 and legacy SSE128. A switch between the first two and the latter is to be avoided, its costly on Intel (not so on AMD), but the first two can be mixed almost freely (you may have to insert `vzeroupper` sometimes) – Gunther Piez Mar 21 '12 at 09:09

3 Answers3

22

I don't think you can do much better than 4 instructions: 2 shuffles and 2 comparisons.

__m256d x = ...; // input

__m128d y = _mm256_extractf128_pd(x, 1); // extract x[2], and x[3]
__m128d m1 = _mm_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3])
__m128d m2 = _mm_permute_pd(m1, 1); // set m2[0] = m1[1], m2[1] = m1[0]
__m128d m = _mm_max_pd(m1, m2); // both m[0] and m[1] contain the horizontal max(x[0], x[1], x[2], x[3])

Trivial modification to only work with 256-bit vectors:

__m256d x = ...; // input

__m256d y = _mm256_permute2f128_pd(x, x, 1); // permute 128-bit values
__m256d m1 = _mm256_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3]), etc.
__m256d m2 = _mm256_permute_pd(m1, 5); // set m2[0] = m1[1], m2[1] = m1[0], etc.
__m256d m = _mm256_max_pd(m1, m2); // all m[0] ... m[3] contain the horizontal max(x[0], x[1], x[2], x[3])

(untested)

Norbert P.
  • 2,737
  • 1
  • 17
  • 21
  • Yes, agreed... Good solution. Thanks. – Luigi Castelli Mar 21 '12 at 08:41
  • The all-256 version is good on Intel CPUs if you need the result broadcasted, but it's much slower on Ryzen. See [Get sum of values stored in \_\_m256d with SSE/AVX](https://stackoverflow.com/a/49943540). (And BTW, `_mm_unpackhi_pd` is 2 bytes shorter than `_mm_permute_pd`, so use that if you only want a scalar result. No immediate needed, and can use a 2-byte VEX prefix.) – Peter Cordes May 05 '18 at 13:37
12

The general way of doing this for a vector v1 = [A, B, C, D] is

  1. Permute v1 to v2 = [C, D, A, B] (swap 0th and 2nd elements, and 1st and 3rd ones)
  2. Take the max; i.e. v3 = max(v1,v2). You now have [max(A,C), max(B,D), max(A,C), max(B,D)]
  3. Permute v3 to v4, swapping the 0th and 1st elements, and the 2nd and 3rd ones.
  4. Take the max again, i.e. v5 = max(v3,v4). Now v5 contains the horizontal max in all of its components.

Specifically for AVX, the permutations can be done with _mm256_permute_pd and the maximums can be done with _mm256_max_pd. I don't have the exact permute masks handy but they should be pretty straightforward to figure out.

Hope that helps.

celion
  • 3,704
  • 24
  • 17
  • 1
    I especially like your solution, because so far it is the only one that uses AVX instructions exclusively, without ever leaving the 256-bit domain. Thanks. – Luigi Castelli Mar 21 '12 at 08:12
  • 1
    sorry, I talked too soon... You can't do that with AVX. Most AVX operations do not cross the 128-bit boundary. So in this case you cannot swap the 0th and 2nd elements and the 1st and 3rd. The AVX permute operation only allows you to swap the 0th and 1st elements or the 2nd and 3rd. – Luigi Castelli Mar 21 '12 at 08:23
  • @LuigiCastelli: my solution can be written so as to never leave the 256-bit domain, if you want. Replace `_mm256_extractf128_pd` by `_mm256_permute2f128_pd(x, x, 1)`, `__m128d` by `__m256d`, and `_mm_...` by `_mm256_...`, `_mm_permute_pd(m1, 1)` by `_mm256_permute_pd(m1, 5)`. – Norbert P. Mar 21 '12 at 08:43
3
//Use the code to find the horizontal maximum
__m256 v1 = initial_vector;//example v1=[1 2 3 4 5 6 7 8]
__m256 v2 = _mm256_permute_ps(v1,(int)147);//147 is control code for rotate left by upper 4 elements and lower 4 elements separately v2=[2 3 4 1 6 7 8 5]
__m256 v3 = _mm256_max_ps(v1,v2);//v3=[2 3 4 4 6 7 8 8]
__m256 v4 = _mm256_permute_ps(v3,(int)147);//v4=[3 4 4 2 7 8 8 6]
__m256 v5 = _mm256_max_ps(v3,v4);//v5=[3 4 4 4 7 8 8 8]
__m256 v6 = _mm256_permute_ps(v5,(int)147);//v6=[4 4 4 3 8 8 8 7]
__m256 v7 = _mm256_max_ps(v5,v6);//contains max of upper four elements and lower 4 elements. v7=[4 4 4 4 8 8 8 8]

//to get max of this horizontal array. Note that the highest end of either upper or lower can contain the maximum
float ALIGN max_array[8];
float horizontal_max;
_mm256_store_ps(max_array, v7);
if(max_array[3] > max_array[7])
{
    horizontal_max = max_array[3];
}
else
{
    horizontal_max = max_array[7];
}
joyx
  • 77
  • 4
  • 1
    It will take one extra step for float vectors, but storing to an array and doing a scalar comparison isn't one of the steps. You still want to start with an `extractf128` / 128bit `maxps`. Doing in-lane stuff first isn't any better on Intel CPUs, and definitely worse on AMD CPUs where 256b AVX ops are twice as expensive as 128b AVX ops. Either way, a 256b store and then two loads -> a scalar compare is just silly, and slower than an `extractf128`. – Peter Cordes Jan 21 '16 at 03:41