0

I'll give an easy example. Let's say I'm interested in letters 'A' to 'Z', only upper case but want to stop after hitting an invalid character. An example input string is "ABC\nDummy Text". Now I can easily make a mask by doing loadu, subtract 'A', then simulate a cmple by using _mm256_max_epu8 and _mm256_cmpeq_epi8. Now I have the following

A  B  C  \n D  u  m  m  y     T  e  x  t
FF FF FF 00 FF 00 00 00 00 00 FF 00 00 00

How do I clear all the FF's after the first 00?

Peter Cordes
  • 286,368
  • 41
  • 520
  • 731
Eric Stotch
  • 4
  • 2
  • 15
  • 1
    near duplicate [Is there a way to mask one end of a \_\_m128i register based on mask length that is not known at compile time?](https://stackoverflow.com/q/65186226) but the answers there seem inefficient vs. a table lookup with a sliding window into `..., -1, -1, 0, 0, ...` bytes like in [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) (but for bytes instead of dwords) – Peter Cordes Sep 05 '21 at 02:46
  • 1
    Also related: [How to most efficiently store a part of \_\_m128i/\_\_m256i, while ignoring some number of elements from the beginning/end](https://stackoverflow.com/q/62183557) but that's about variable-width stores, not just zeroing high elements. – Peter Cordes Sep 05 '21 at 02:48
  • @PeterCordes You're suggesting there's no built in way to do this!?! WTF!?!? I imagine this is something extremely common :| Maybe I can do something like +1 and trailing zero count then figure it out from there. No idea how well that'll work – Eric Stotch Sep 05 '21 at 02:51
  • Huh? Are you talking about a 128-bit integer +1 operation? SIMD can't do that. If you mean `_tzcnt_u32( 1 + _mm_movemask_epi8(vcmp) )` to find the position of the first zero, then yeah you can do that. I was picturing `_tzcnt_u32( ~_mm_movemask_epi8(vcmp) )` which will give the same result, and then use that as a byte offset to load an AND mask. IDK what you were picturing doing with your tzcnt result. – Peter Cordes Sep 05 '21 at 02:59
  • 1
    What exactly do you want the vector result for? Perhaps there's something easier you can do. – Peter Cordes Sep 05 '21 at 03:05
  • @PeterCordes I was thinking about that and looked up the inverse of movemask. I saw your long answer https://stackoverflow.com/questions/65186226/is-there-a-way-to-mask-one-end-of-a-m128i-register-based-on-mask-length-that-i What I did was `a = _mm256_add_epi64(in, _mm256_set1_epi64x(1));` `_mm256_cmpeq_epi8(a, _mm256_setzero_si256());`. It certainly gave me my mask but only in the 64bit region. Now I'm thinking about if there's an easy way to clear bits or if I want to do the movemask inverse – Eric Stotch Sep 05 '21 at 03:06
  • Also, `cmple(x,y) = ~cmpgt(x,y)`, so you could maybe do that by inverting instead of max / eq. That might save an inversion step somewhere in implementing this. Or not since you're using `max_epu8` unsigned, so you're actually implementing `cmpbe` (below or equal), not signed less-or-equal. – Peter Cordes Sep 05 '21 at 03:08
  • I basically want to zero out the rest of the string to do my other operations. My learning example is a simple SIMD parseint that only supports base 10. So I guess `atol` but in SIMD – Eric Stotch Sep 05 '21 at 03:09
  • Thinking more about it. I'm probably going to convert each letter into 16bit pairs, then 32bit pairs. Things 64bit away probably won't affect my calculations. I think just being able to clear the upper bytes on a 64bit bases could work. Just need to write it to confirm. I highly suspect it will work – Eric Stotch Sep 05 '21 at 03:13
  • "(below or equal), not signed less-or-equal" – Eric Stotch Sep 05 '21 at 03:19
  • 1
    See [How to implement atoi using SIMD?](https://stackoverflow.com/a/35132718) for a complete solution. It handles the zeroing along with reversing into LSD-first order, with a lookup table for `pshufb` masks. – Peter Cordes Sep 05 '21 at 03:21
  • BE is the x86 asm condition name, like for `jb` / `jbe` https://www.felixcloutier.com/x86/jcc. Other ISAs have other names for signed vs. unsigned conditions. AVX-512 `vpcmpub` intrinsics (https://www.felixcloutier.com/x86/vpcmpb:vpcmpub) apparently do use `_mm512_cmp[eq|ge|gt|le|lt|neq]_epu8_mask`, not be / ae, so apparently the convention for intrinsics (as far as one exists) doesn't match asm. – Peter Cordes Sep 05 '21 at 03:25
  • @PeterCordes I'm so tempted to look at that implementation. Mine is almost done so I'll hold off and see how different they are. Thanks for the link – Eric Stotch Sep 05 '21 at 03:35

0 Answers0