Query about legacy 3DNow! instruction set

Question

Just for fun I'm reviewing legacy (deprecated) instructions from 3DNow! set introduced by AMD, and I'm trying to understand how were they used. All instructions seem to be encoded following this pattern:

instruction destination_MMn_register_operand, source_MMn_register_or_memory_operand

where destinationRegister = destinationRegister -operation- source

Like, for instance, pfadd mm0, mmword ptr [rcx] (0F 0F 01 9E):

Would add 2 packed floats from memory pointed by rcx to 2 packed floats stored in mm0 and keep result in mm0.

So it seems like those 3DNow instructions always have an mm register as a destination.

But how were you supposed to get the results out of those mm registers?

In other words, there's no mov mmword ptr [rcx], mm0, or mov rax, mm0 instructions.

score 3 · Answer 1 · answered Aug 06 '18 at 20:58

3

Actually there are, namely movd and movq. These instructions are not part of 3DNow!, they were already present in MMX which 3DNow! is an extension to. That is also why 3DNow! includes a very incomplete-seeming set of integer operations.

answered Aug 06 '18 at 20:58

harold

57,174
6
82
154

score 3 · Accepted Answer · answered Aug 06 '18 at 22:24

3

As @harold says, storing to memory is already covered by MMX movd, or pshufw+movd to extract just the high float.

The one thing you can't do is turn an 3dNow! float into an x87 80-bit float without a store/reload.

What might have been potentially useful is a version of EMMS that expands a 32-bit float into an 80-bit x87 long double in st0, along with setting the FPU back into x87 mode instead of MMX mode¹. Or maybe even do that for multiple mm registers into multiple x87 registers?

i.e. it would be a shortcut for movd dword [esp], mm0 / emms / fld dword [esp] to set up for further scalar FP after a SIMD reduction.

Remember that these are IEEE754 floats; you normally don't want them in integer registers unless you're picking apart their bit-fields (e.g. for an exp or log implementation), but you can do that with MMX shift/mask instructions.

But movd and fld are cheap, so they didn't bother making a special instruction just to save the reload latency. Also, it might have been slow to implement as a single instruction. Even though x86 is not a RISC ISA, having one really complex instruction is often slower than multiple simpler instructions (especially before decoding to multiple uops was fully a thing.) e.g. Intel and AMD's sysenter and syscall instructions to replace int 0x80 for system calls require additional instructions before/after to save more state, but are still overall faster.

3dNow!'s femms leaves the MMX/3dNow! register contents undefined, only setting the tag words to unused instead of preserving the mapping from MMX registers to/from x87 register contents. See http://refspecs.linuxbase.org/AMD-3Dnow.pdf for an official AMD manual. IDK if AMD's microarchitectures just dropped the register-renaming info or what, but probably making store / femms / x87-load the fast way saves a lot of transistors.

Or even FEMMS is still somewhat slow, so they don't want to encourage coders to leave/re-enter MMX/3dNow! mode at all often.

Fun fact: 3dNow! PREFETCHW (prefetch with write intent) is still used, and has its own CPUID feature bit.

See my answer on What is the effect of second argument in _builtin_prefetch()?

Intel CPUs soon added support for decoding it as a NOP (so software like 64-bit Windows can use it without checking), but Broadwell and later actually prefetch with a RFO to get the cache line in MESI Exclusive state, rather than Shared, so it can flip to Modified without additional off-core traffic.

The CPUID feature bit indicates that it really will prefetch.

Footnote 1:

Remember that the MMX registers alias the x87 registers, so no new OS support was needed to save/restore architectural state on context switches. It wasn't until SSE that we got new architectural state. So it wasn't until SSE2+3dNow! that a 3dNow! float to SSE2 double could make sense without switching back to x87 mode. And you could movq2dq xmm0, mm0 + cvtps2pd xmm0, xmm0.

They could have had a float->double in a mm register, but the fld / fst hardware was only designed for float or double->80-bit and 80-bit->float or double. And the use-case for that is limited; if you're using 3dNow!, just stick to float.

answered Aug 06 '18 at 22:24

Peter Cordes

286,368
41
520
731

Thanks for the info. Very interesting. Btw, I noted after perusing AMD documentation (linked in my answer) that they refer to single-precision floating point numbers that they use for 3Dnow instructions as having a 24-bit significand. But from what I understand the Intel's traditional 32-bit floats use a 23-bit mantissa. Are 3Dnow's packed floats using a different floating-point format than Intel? – MikeF Aug 07 '18 at 01:18
@MikeF: I don't think so. Almost certainly just a terminology issue; AMD is counting the implicit bit in the significand. Wikipedia has a nice article (https://en.wikipedia.org/wiki/Single-precision_floating-point_format#IEEE_754_single-precision_binary_floating-point_format:_binary32) which describes IEEE binary32 as 24 bit precision, 23 stored. (For subnormal numbers, the first bit of the significand is 0, instead of the usual 1. So an all-zero or not exponent implies the leading bit.) BTW, "significand" is the preferred terminology, but mantissa is more widely used. It's the same thing. – Peter Cordes Aug 07 '18 at 01:54
Oh, OK. I was just curious. Who knows, AMD could've invented their own floating point format. It doesn't matter at this point though, as 3DNow is almost 99% dead anyway. I think the only thing that still remains from it is the `prefetchw` instruction. [Windows uses it](https://i.imgur.com/kAXzPqO.png) for pretty much every kernel-mode call to preload some kernel structure there. As for `FEMMS` instruction that you pointed out, I think it's AMD only. It was #UD'ing on all of my Intel systems. – MikeF Aug 07 '18 at 02:03
1

@MikeF: Yes, `femms` is part of 3dNow!, and wasn't adopted as an MMX/SSE extension. I mentioned it because CPU vendors design their instruction-set extensions to be convenient for their own current microarchitectures. (Another example of that: Intel's SSE `cvtsi2ss xmm0, eax` leaves the upper bytes of XMM0 unmodified, probably so it can be single uop on Pentium III, which splits 128-bit vector ops into 2. But that short-sighted false dependency has led to gcc choosing to `pxor xmm0,xmm0` first, to avoid the risk of creating a loop-carried dep chain or coupling to a slow dep chain. – Peter Cordes Aug 07 '18 at 02:03
Oh, Peter, also meant to bring up. Does Intel support `prefetch` instruction (the one without fetching for "write")? It seems like it's AMD-only, a legacy from the 3Dnow set, but Intel documentation is surprisingly silent about it. – MikeF Aug 07 '18 at 02:06
@MikeF: regular `prefetch` have been supported on Intel for a long time. http://felixcloutier.com/x86/PREFETCHh.html doesn't even list an ISA extension, so it may even predate MMX? The NASM manual's appendix lists when insns were new, even back to 186. https://www.nasm.us/doc/nasmdocb.html says `PREFETCH` was new in Pentium, while `PREFETCHT0/1/2` / `PREFETCHNTA` were new in Katmai (first-gen PIII, so I guess with SSE). IDK what they mean about plain `prefetch`; maybe there was an earlier version of the opcode that ignored the `/r` field in ModRM and just prefetched. – Peter Cordes Aug 07 '18 at 02:11
Hah, very interesting. Thanks for sharing. I was also curious about that strange behavior of the `cvtsi2ss` instruction. Although IMO, it's one of those SISC instructions that you'd be hard pressed to find in your average code. As for the prefetch, then no, I was referring to an even more ancient instruction. The one with the `0F 0D modR/M` encoding, the one that doesn't even specify which cache level to use: L1, L2. – MikeF Aug 07 '18 at 02:22
@MikeF: Compilers use `cvtsi2sd` all the time in code that uses FP and integer values together (`double` is more common than `float`). e.g. https://godbolt.org/g/M52CXv. SSE2 kept the unnecessary-dependency behaviour for that instruction too, so gcc emits two dep-breaking `pxor` instructions. (Clang is more optimistic). Fun trick: with AVX you can use the same zeroed register as a no-dependency source, like `vcvtsi2sd xmm0, xmm7, eax`, not destroying the zeroed register. Clang uses this sometimes. Lots of code doesn't get auto-vectorized but is still perf-relevant. – Peter Cordes Aug 07 '18 at 02:50
@MikeF: Looks like my first guess was insufficient for `0F 0D` prefetch, because clearly it's not the same encoding as the SSE prefetches. IDK why the NASM manual flags that as `PENT,3DNOW`. (Note that `0F 0D /1` is prefetchw, so the /r field must matter for the non-write prefetch on CPUs that actually implement it. – Peter Cordes Aug 07 '18 at 02:53

Query about legacy 3DNow! instruction set

2 Answers2

Linked