Why are PUSHF and POPF so slow?

Question

the experiment is on 32-bit x86 Linux.

I am doing some static binary instrumentation work, and basically I am trying to insert some instructions below to the beginning of every basic block.

BB23 : push %eax

movl index,%eax
movl $0x80823d0,buf(,%eax,0x4)
add $0x1,%eax
cmp $0x400000,%eax
jle BB_23_stub
movl $0x0,%eax
BB_23_stub:movl %eax,index

pop %eax

Note that I need to use cmp instruction, and in order to guarantee that flags can restore to the original value, I use pushf and popf to store\load flags on the stack.

Then it becomes this:

 BB_23 :    push %eax
       pushf               
       movl index,%eax
       movl $0x17,buf(,%eax,0x4)
       add $0x1,%eax
       cmp $0x400000,%eax
       jle BB_23_stub
       movl $0x0,%eax
BB_23_stub:movl %eax,index
       popf             
       pop %eax

I tested the performance with and without pushf and popf (I am using gzip and bzip). And to my surprise, performance penalty could increase even 3 times after using the pushf and popf instructions!!

However, without pushf and popf. The compression results of gzip and bzip are incorrect.

So here is my question:

Why pushf and popf so slow? Am I using it in a correct way?

I cannot afford too much performance penalty introduced by pushf and popf. Is there any way I can avoid the high overhead and also keep the correct semantics? (protecting the value in flags, basically..)

Am I clear enough? Could anyone give me some help?

Oh, and replacing inc %eax with lea eax, [eax+1] (sorry, Intel Syntax, I don't really like AT&T syntax and don't know how to translate it right now) will avoid changing the flags like inc does. Now if i could just figure out how to do the and without changing flags and you could get rid of those pesky pushf and popf instructions ... — Guntram Blohm, Jul 14 '15 at 18:54
@GuntramBlohm. Brilliant!! I really really appreciate your kind help! It really saves my ass.. — lllllllllllll, Jul 14 '15 at 18:55
You seem to be starting with index 0, incrementing up to 0x400000 and wrapping around there. If you can afford to do it the other way round, you could misuse the loop instruction which doesn't change flags. Initialize your index to 0x400000, use ecx instead of eax, and to decrement and re-init on zero, use loop forward, mov $0x400000, %ecx, forward: movl %ecx, index. Consider the loop a decrement and jump if not zero. — Guntram Blohm, Jul 14 '15 at 19:16

score 9 · Accepted Answer · answered Jul 14 '15 at 19:27

9

Clever (some would say incomprehensible) misuse of x86 features could do this for you. The loop instruction will decrement the ecx register, jump if it's nonzero, and not modify flags. You can use this as a jump forward instruction as well, like this:

BB23:      push %ecx
           movl index, %ecx
           movl $0x17, buf-4(,%ecx,4)
           loop BB23_stub
           movl $0x400000, %ecx
BB23_stub: movl %ecx, index
           pop %ecx

Note that ecx runs from 0x400000 to 1 here, not from 0 to 0x3fffff, so i had to subtract 4 from the address of buf, and you need to read the buffer top to bottom when analyzing it. Don't forget to initialize index to 0x400000 at the start of your code somewhere. You'll have to test how much the penalty of branching in loop costs in comparison to how much removing pushf/popf gains.

answered Jul 14 '15 at 19:27

Guntram Blohm

12,950
2
22
32

1

loop is really slow (7 uops, throughput of one per 5 cycles), but saving/restoring the flags is even slower. (pushf = 3 uops, popf = 9). (Intel SnB/Haswell). – Peter Cordes Jul 15 '15 at 03:11
Yes, it sounds like loop is your best option. Just keep in mind, that one loop is taking more execution resources, and more space in the uop cache, than all 6 other instructions combined. Hopefully it's still light-weight enough for your purposes. – Peter Cordes Jul 15 '15 at 14:06
1

Possibly faster: push %eax / lahf / use flags / sahf / pop %eax. load/store AH from/to flags are single-uop, single-cycle-latency instructions on current Intel and AMD CPUs. If you're instrumenting something that doesn't touch the MMX / x87 registers, you could use them for storing index, and maybe also for masking it after wraparound. Oh, I see there was already an answer with this. – Peter Cordes Jul 22 '15 at 06:46

Ian Cook · Answer 2 · 2015-07-18T20:07:40.527

If you look at lib/Target/X86/X86InstrInfo.cpp in the LLVM source code you can see that they prefer the LAHF and SAHF instructions to PUSHF and POPF for speed reasons. These instructions don't deal with the overflow flag OF so this must be handled with separately.

alt_pushf:        seto %al                  ; save OF to AL
                  lahf                      ; save other flags to AH
                  push %eax                 ; push

alt_popf:         pop %eax                  ; pop
                  addb $127, %al            ; restore OF
                  sahf                      ; restore other flags

I don't know if this will be any faster than @GuntramBlohm's clever LOOP option so it might be worth benchmarking.

(Note that should you want to use this in future for 64-bit code you will need to check for the presence of the LAHF and SAHF instructions.)

score 2 · Answer 3 · answered Jul 22 '15 at 08:30

Posting a 2nd answer for a different method, combining cmov to avoid a skip-1-instruction branch with @Ian Cook's nice lahf/sahf.

       push   %ecx
       movl   index, %ecx

       push   %eax
       seto   %al            # save OF to AL
       lahf                  # save other flags to AH

       movl   $0x17,  buf(,%ecx,0x4)
       dec    %ecx
       cmovc  buflen, %ecx       # load buflen constant from memory on wraparound

       addb $127, %al            # restore OF
       sahf                      # restore other flags
       pop %eax

       movl   %ecx,index
       pop %ecx

This is 14 insns, all single-uop single cycle latency (on Intel). So it's probably still slower than the LOOP version, except for not affecting the branch predictor if this code is duplicated all over the place.

With Intel ADX (add-with-carry using CF or OF, to allow two dep chains in parallel), you can avoid clobbering the overflow flag. But it doesn't take an immediate arg, so you need a constant (-4) in memory. You need to detect wrapping around zero, and avoid cmp. This instruction set extension was first supported in Broadwell (barely available for desktops, and not even all currently-for-sale laptops have it.)

Anyway, clc / adcx minus_one, %ecx instead of dec %ecx would save net instructions (one clc to save a seto and addb $127 to save/restore the overflow flag), which isn't much. 13 uops is still more than my other answer, using an MMX reg for sub/mask to avoid touching flags.

Another possibility is using lea, and zeroing the high bits with a non-flag-affecting left and right shift (BMI2 (Haswell) instruction set's SHLX / SHRX). This avoids touching flags entirely:

       push   %ecx
       movl   index, %ecx

       movl   $0x17,  buf(,%ecx,0x4)
       lea    -1(%ecx), %ecx
       push   %eax
       movl   $bit_count, %eax   # 32 - significant bits in buflen
       shlx   %eax, %ecx, %ecx   # shift count has to be in a reg
       shrx   %eax, %ecx, %ecx
       pop    %eax

       movl   %ecx,index
       pop %ecx

Well crap, no-flag shifts are only available as (Intel syntax) shrx r32a, r/m32, r32b, loading the the value to be shifted, not the shift count. And an immediate shift count isn't available either, so I still needed to push/pop eax to get a 2nd register.

So this is 11 uops on Intel, all single-cycle latency. It still doesn't beat the mmx version.

score 1 · Answer 4 · edited May 23 '17 at 12:37

What if you have index count downward, and unconditionally mask it to handle wraparound, instead of a conditional? Hmm, AND sets all the flags, including OF (which isn't saved/restored with lahf/safh). You could use an MMX register, but PAND doesn't have an immediate form, so you'd need to have the constant in memory.

BB23:      push %ecx
           ; movq %mm0, -8(%esp)   ; not safe if a signal handler fires while data is below the stack.
            ;  x86 has no red-zone.  But we can't sub $16, %esp  without clobbering flags
           movq   %mm0, save_mm0
           movd   index, %mm0
           psubd  one, %mm0      ;  mmx has no dec-by-one
           pand   my_mask, %mm0   ; (0x400000-1).  0-max -> untouched.  all-1s after wraparound -> max
           movd   %mm0, %ecx
           movl   $0x17, buf(,%ecx,4)
           ; movq   -8(%esp), %mm0
           movq   save_mm0, %mm0
           movl   %ecx, index
           pop    %ecx

On Intel, this is 10 uops, so it's potentially faster than the version using LOOP. Or only 8, if the code you're instrumenting doesn't use MMX, or doesn't use SSE, so you could avoid saving / restoring a vector reg. Jumps interrupt the flow of uops from the decoders or uop cache, so it has that going for it, too.

It needs another 8 bytes of constants. If they're in the same cache-line as index, that's not a big deal. It does take significantly more instruction bytes. On the upside, it's branchless, so inserting it all over the place won't pollute the branch predictor with a lot of taken branches. (Arranging branches so the non-taken case is the common one would be better. The save/restore flags version could use a cmov from a zeroed memory location, instead of a branch.)

On SnB and newer, the scaled-offset version of store might not micro-fuse. If the immediate data doesn't count as a 3rd input dependency, then it still can. Otherwise, scale everything up by 4, including the constant for psubd, then the store is movl $0x17, buf(%ecx).

My first version was going to save %mm0 on the stack, but there's no push for MMX regs. That would have made it 11 uops, counting the stack-engine synchronization uop inserted before the movq %mm0, -8(%rsp), since it follows a stack instruction (push).

Why are PUSHF and POPF so slow?

4 Answers4