3

The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead.

Since not all compilers are made the same, this hides bugs. It would be nice to be able to be able to turn the actual aligned operations on, even though the performance hit that used to be there doesn't seem to be there anymore.

In other words, writing code:

__m128 p1 = _mm_load_ps(data);

currently produces:

movups      xmm0,xmmword ptr [eax]

expected result:

movaps      xmm0,xmmword ptr [eax]

(I was asked by microsoft to ask here)

Peter Cordes
  • 286,368
  • 41
  • 520
  • 731
Jari Komppa
  • 562
  • 4
  • 20
  • Correct, Intel since Nehalem has no penalty for `movdqu` loads / stores when the address doesn't cross a cache-line boundary (which includes the aligned case). AMD since Bulldozer has no penalty when it does cross a 32 or maybe 16-byte boundary. K10 has a penalty for `movups` stores but not loads even on aligned addresses, Core 2 has a penalty for both. I don't use MSVC so IDK if it has any tune option that takes those old CPUs into consideration. (gcc / clang use `movaps` whenever there's a compile-time alignment guarantee, so using one of those would be another option.) – Peter Cordes May 15 '20 at 13:47
  • 1
    https://stackoverflow.com/questions/42697118/visual-studio-2017-mm-load-ps-often-compiled-to-movups – Hans Passant May 15 '20 at 14:23
  • The question isn't whether there's a penalty or not, but to be able to use this as a debugging tool. – Jari Komppa May 16 '20 at 14:05

0 Answers0