How to decide which x86-64 (SSE) instruction is more effective?

Question

I want to optimize my x86-64 program. How do I decide which instructions are "the best ones"? How one measures that one certain piece of assembly code is faster than the other?

Example 1:

xmm0 [ 1 | 2 | 3 | 4 ]
xmm1 [ 0 | x | 0 | 0 ]

I want to move x in the place of 2. So, I can do

pslldq xmm1, 4                  # like in the picture
shufps xmm0, xmm0, 0x39
movss  xmm0, xmm1
shufps xmm0, xmm0, 0x93

or

blendps  xmm0, xmm1, 0x4

or

insertps xmm0, xmm1, 0x50

or etc. Which one is the fastest / easiest?

Example 2:

I want to have 1.0 in xmm0 without reading from memory. I can do

pcmpeqw xmm0,   xmm0
pslld   xmm0,   25
psrld   xmm0,   2

or

mov   eax, 0x3f800000
movd  xmm9, eax
# shift to the right position ...

or etc. Which one is the fastest / easiest?

Refer to the Intel Optimisation Manuals or Agner Fog's instruction tables. These give latencies and port usage for each of these instructions, allowing you to compare them. — fuz, Jan 26 '21 at 11:44
@fuz I have seen them, but I do not know how to work with them. My first impression was that the data shown in the tables is processor-specific. — jupiter_jazz, Jan 26 '21 at 11:46
Why can't you time them yourself, a few millions of unrolled loops with clock() or the rdtsc instruction? — Arthur Kalliokoski, Jan 26 '21 at 11:55
Yes, "what is faster" is processor-specific, no way around that. — Marc Glisse, Jan 26 '21 at 11:59
@ArthurKalliokoski good idea, never thought about that! Is that assembly or C? Right now I make calls in C to extern functions written in .S files. — jupiter_jazz, Jan 26 '21 at 12:02
@Kirill Yes, it's processor specific, but the processors are very similar. Additionally, instructions performing a similar operation usually have similar performance characteristings. If one of them is slower on a certain processor, all of them are slower there. — fuz, Jan 26 '21 at 12:04
Writing code in C with intrinsics, instead of directly as assembler, is a way to delegate this job to the compiler (which doesn't always do a perfect job, but usually ok). At least you can compare the output with the asm you wrote. — Marc Glisse, Jan 26 '21 at 12:14
@MarcGlisse in this current work I am not allowed to use intrinsics, so the layout has to stay :) I think I will take the advice of Arthur and wrap the call within a clock testing fragments of code explicitly. — jupiter_jazz, Jan 26 '21 at 12:18
@Kirill Even if you are not allowed to use intrinsics, it can be helpful to prototype the code with intrinsics and then derive your final version from the compiler's output. — fuz, Jan 26 '21 at 16:48
https://uops.info/ - `blendps` is single-uop, 1 cycle latency, and definitely the best choice if you can use SSE4. Fortunately we have some duplicates for the general case of this, such as [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) which links to uops.info and Agner Fog's guides. — Peter Cordes, Jan 26 '21 at 19:13

How to decide which x86-64 (SSE) instruction is more effective?

0 Answers0