Do I need to use _mm256_zeroupper in 2021?

Question

From Agner Fog's "Optimizing software in C++":

There is a problem when mixing code compiled with and without AVX support on some Intel processors. There is a performance penalty when going from AVX code to non-AVX code because of a change in the YMM register state. This penalty should be avoided by calling the intrinsic function _mm256_zeroupper() before any transition from AVX code to nonAVX code. This can be necessary in the following cases:

• If part of a program is compiled with AVX support and another part of the program is compiled without AVX support then call _mm256_zeroupper() before leaving the AVX part.

• If a function is compiled in multiple versions with and without AVX using CPU dispatching then call _mm256_zeroupper() before leaving the AVX part.

• If a piece of code compiled with AVX support calls a function in a library other than the library that comes with the compiler, and the library has no AVX support, then call _mm256_zeroupper() before calling the library function.

I'm wondering what are some Intel processors. Specifically, are there processors made in the last five years. So that I know if it is too late to fix missing _mm256_zeroupper() calls or not.

score 6 · Accepted Answer · edited Sep 24 '21 at 18:05

TL:DR: Don't use the _mm256_zeroupper() intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)

"Some Intel processors" being all except Xeon Phi.

Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.

On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty

Related Q&As about the effects of asm vzeroupper (in the compiler-generated machine code):

Intrinsics in C or C++ source

You should pretty much never use _mm256_zeroupper() in C/C++ source code. Things have settled on having the compiler insert a vzeroupper instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper after that.

The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256 / __m256i/d args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.

x86-64 System V passes vectors in vector regs. Windows vectorcall does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.

Some compilers optimize by also omitting vzeroupper before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i) will leave the CPU in clean-upper state, so the callee does still need a vzeroupper after such a function, before call printf or whatever.

Compilers have options to control `vzeroupper` usage

For example, GCC -mno-vzeroupper / clang -mllvm -x86-use-vzeroupper=0. (The default is -mvzeroupper to do the behaviour described above, using when it might be needed.)

This is implied by -march=knl (Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).

Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.

For MSVC, you should definitely prefer using -arch:AVX when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128 and __m256 without /arch:AVX. But beware that that option will make even 128-bit intrinsics like _mm_add_ps use the AVX encoding (vaddps) instead of legacy SSE (addps), though, and will let the compiler auto-vectorize with AVX. There is undocumented switch /d2vzeroupper to enable automatic vzeroupper generation (default), /d2vzeroupper- disables it - see What is the /d2vzeroupper MSVC compiler optimization flag doing?

Wrote to Agner, he replied he will mention that the compiler may add _mm256_zeroupper automatically with the next manual update — Alex Guteniev, Aug 11 '21 at 09:56
@AlexGuteniev: Hopefully he'll actually say that `vzeroupper` (the asm instruction) is added automatically by the compiler. `_mm256_vzeroupper` is an intrinsic, and compiler don't work by transforming the source, they work by emitting asm. It makes little to no sense to say that `_mm256_vzeroupper()` is *added* automatically, just that compilers understand SSE-AVX transition effects well enough that it's not needed. — Peter Cordes, Aug 11 '21 at 10:34
There is a good reason to disable automatic generation of `vzeroupper` - when you call your own AVX-vectorized functions between different translation units. If a function doesn't take or return vectors the compiler has to assume it expects legacy SSE state and generate `vzeroupper`. In this case one should disable automatic `vzeroupper` and insert the intrinsic manually for the given TUs where it matters. You can leave it enabled for other TUs. — Andrey Semashev, Aug 13 '21 at 22:23
@PeterCordes, it is updated. You'd be disappointed: _The compiler may or may not insert _mm256_zeroupper() automatically. The assembly output from the compiler will tell what it does_ — Alex Guteniev, Sep 01 '21 at 14:19

Alex Guteniev · Answer 2 · 2021-09-15T10:36:41.347

2

AVX -> SSE penalty without zeroing applies to the current processors. See Intel® 64 and IA-32 Architectures Optimization Reference Manual, June 2021.

However, missing _mm256_zeroupper() in C/C++ code is not necessarily a problem. Compiler may insert it by itself. All compilers do: https://godbolt.org/z/veToerhvG

Experiments show that automatic vzeroupper insertion works in VS 2015, but does not work in VS 2012

edited Sep 15 '21 at 10:36

answered Aug 11 '21 at 06:18

Alex Guteniev

10,518
2
31
66

Do I need to use _mm256_zeroupper in 2021?

2 Answers2

Intrinsics in C or C++ source

Compilers have options to control `vzeroupper` usage

Linked

Do I need to use _mm256_zeroupper in 2021?

2 Answers2

Intrinsics in C or C++ source

Compilers have options to control vzeroupper usage

Linked

Compilers have options to control `vzeroupper` usage