0

The following should be done in x86 assembly (up to SSE4) language. Lets say I have a 128 bit XMM register:

xmm0: [d, c, b, a]

And I want another register which look like this:

xmm1: [-,-, c, d]

For xmm1 the high 64 bits are not important. The ultimate goal is I want to have

[-,-,b+c,a-d]

I can do this then with addsub. Is there any useful instructions up to SSE4 to do this efficiently? How can I do this at all? My Idea is some shift operations but couldnt find any useful in the Intel Developer Manual.

  • https://stackoverflow.com/q/56407741/11683? – GSerg Jan 11 '21 at 09:51
  • what intrinsics have you tried so far? you know what i mean by intrinsics .... or you don't? – Алексей Неудачин Jan 11 '21 at 09:58
  • 2
    My answer on your previous question, [Expand the lower two 32-bit floats of an xmm register to the whole xmm register](https://stackoverflow.com/a/65642003), shows how to get a compiler to come up wit ha shuffle for you, so you don't have to ask a separate question for every specific shuffle. It also points out the existence of `shufps` and `pshufd` which can do arbitrary dword shuffles. – Peter Cordes Jan 11 '21 at 10:06
  • `_mm_shuffle_epi32` , `_mm_cvtepu32_epi64` , `_mm_shuffle_epi32` – Алексей Неудачин Jan 11 '21 at 10:21
  • @PeterCordes it's not a duplicate really. i do not see code at all in link you'd posted. and this one has nothing to do with zero-extend – Алексей Неудачин Jan 11 '21 at 19:21
  • https://github.com/alexeyneu/BlockZero/blob/master/newone/x.asm – Алексей Неудачин Jan 11 '21 at 19:29
  • @АлексейНеудачин: The code I was talking about in my linked answer is `_mm_set_ps(v[1], v[1], v[0], v[0])` and the Godbolt link. Change those numbers to anything you want for an arbitrary shuffle and see what the compiler does. (`shufps` with some constant, if no special-case shorter instruction exists.) – Peter Cordes Jan 12 '21 at 02:38
  • @PeterCordes but it is C while we see assembly tag – Алексей Неудачин Jan 12 '21 at 09:47
  • @АлексейНеудачин: yes, it's a method for using a C compiler to answer this question and any like it. C compilers emit asm which you can look at, and this is a way to give them input that shows what element you want where. – Peter Cordes Jan 12 '21 at 09:55
  • @PeterCordes it's `imm8`. So it should be done in nasm preprocessor. open it.really – Алексей Неудачин Jan 12 '21 at 09:58
  • @АлексейНеудачин: You're assuming that the answer should be a NASM macro like `_MM_SHUFFLE` that you use with `shufps` or `pshufd`? But some shuffles can be done with other special-case instructions like `unpckhps`, `movhlps`, or whatever. This is why using a C compiler is nice. If you want to write a general-case answer with a NASM macro, it would fit almost as well on the other question. Or if you want to write a self-answered Q&A about how to do any shuffle, that could work, too. This question isn't a good fit for that because it's about one specific shuffle. – Peter Cordes Jan 12 '21 at 10:04
  • @PeterCordes so you just admitted that what you posted is nothing like answer. someone can go with `00010011b` only if he know `imm8` format there. – Алексей Неудачин Jan 12 '21 at 10:53
  • @АлексейНеудачин: I was never suggesting that the linked answer directly explains how to write args for `pshufd`. The linked answer tells you how to ask a computer instead of other humans for the answer to this. Writing the immediate in hex or binary is just a matter of translating the number the compiler spits out. (Or looking up how pshufd works in the manual...) – Peter Cordes Jan 12 '21 at 11:02
  • @АлексейНеудачин That makes no sense. Most of the asm in my answers *is* better (or not worse) than MSVC's asm output. When compilers already do a good job, like for trivial / boring question such as this, we can just look at their output. But regardless, there is a duplicates: [How to reverse an \_\_m128 type variable?](https://stackoverflow.com/q/20051746) even takes the trouble to explain the details of how `_MM_SHUFFLE` encodes the constant for shufps, making it easy to port to asm by hand even without using a compiler. And it's a shuffle that produces the desired low 2 elements. – Peter Cordes Jan 12 '21 at 16:01

0 Answers0