0

In X86 platform, I have a __m128 variable containing 4 uint32_t variables. Now I want to use SIMD to count the number of zero (or non-zero) variables in the __m128 variable quickly. Both SSE and AVX are OK. What can I do?

dy66
  • 21
  • 2
  • 1
    Which platform? X86? Which extensions allowed? SSE version? AVX? – Sebastian Feb 08 '22 at 08:28
  • 1
    `fast_horizontal_sum(_mm_cmpeq_epi32(var, _mm_setzero_si128()))` https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction – Aki Suihkonen Feb 08 '22 at 08:37
  • 4
    Do you want to perform this operation once or multiple times? I.e. do you have more than one __m128 and want to count the overall sum? Then there could be additional optimizations. – Sebastian Feb 08 '22 at 08:43
  • 2
    If you have many vectors, you can use `totals -= cmp(v, 0)` like in [How to count character occurrences using SIMD](https://stackoverflow.com/q/54541129) but without the complication of such narrow element that overflow quickly. You'll probably want to use `__m128i` with `_mm_cmpeq_epi32`for `uint32_t` data; `_mm_cmpeq_ps(__m128, __m128)` would find != 0.0 for bit-patterns that represent NaN. – Peter Cordes Feb 08 '22 at 10:01
  • 3
    If it is indeed just one register, something like `zeros = _mm_popcnt_u32(_mm_movemask_ps(_mm_castsi128_ps(_mm_cmpeq_epi32(var, _mm_setzero_si128()))))` should work (or `nonzeros = 4 - zeros`). – chtz Feb 08 '22 at 10:45
  • 1
    In fact, I will perform this operation twice in two different __m128, and compare their results immediately in my application scenario – dy66 Feb 08 '22 at 12:51

0 Answers0