5

the experiment is on 32-bit x86 Linux.

I am doing some static binary instrumentation work, and basically I am trying to insert some instructions below to the beginning of every basic block.

BB23 : push %eax

movl index,%eax
movl $0x80823d0,buf(,%eax,0x4)
add $0x1,%eax
cmp $0x400000,%eax
jle BB_23_stub
movl $0x0,%eax
BB_23_stub:movl %eax,index

pop %eax

Note that I need to use cmp instruction, and in order to guarantee that flags can restore to the original value, I use pushf and popf to store\load flags on the stack.

Then it becomes this:

 BB_23 :    push %eax
       pushf               
       movl index,%eax
       movl $0x17,buf(,%eax,0x4)
       add $0x1,%eax
       cmp $0x400000,%eax
       jle BB_23_stub
       movl $0x0,%eax
BB_23_stub:movl %eax,index
       popf             
       pop %eax

I tested the performance with and without pushf and popf (I am using gzip and bzip). And to my surprise, performance penalty could increase even 3 times after using the pushf and popf instructions!!

However, without pushf and popf. The compression results of gzip and bzip are incorrect.

So here is my question:

Why pushf and popf so slow? Am I using it in a correct way?

I cannot afford too much performance penalty introduced by pushf and popf. Is there any way I can avoid the high overhead and also keep the correct semantics? (protecting the value in flags, basically..)

Am I clear enough? Could anyone give me some help?

lllllllllllll
  • 2,485
  • 2
  • 32
  • 50
  • 1
  • Just an idea without any claim to correctness: pushf might wreck havoc on the instruction pipeline, since it needs all flags valid, while most other instructions don't care about the flags. In the same way, the pipeline might get delayed by popf if an instruction that needs a flag follows. 2) I'd replace your add - cmp - jle - mov combo with inc %eax, and $0x3fffff, %eax which should speed up the code a bit since it avoids a branch. This won't help you with flags, however, I don't see a way to do this without touching flags.
  • – Guntram Blohm Jul 14 '15 at 18:44
  • 1
    Oh, and replacing inc %eax with lea eax, [eax+1] (sorry, Intel Syntax, I don't really like AT&T syntax and don't know how to translate it right now) will avoid changing the flags like inc does. Now if i could just figure out how to do the and without changing flags and you could get rid of those pesky pushf and popf instructions ... – Guntram Blohm Jul 14 '15 at 18:54
  • @GuntramBlohm. Brilliant!! I really really appreciate your kind help! It really saves my ass.. – lllllllllllll Jul 14 '15 at 18:55
  • 1
    You seem to be starting with index 0, incrementing up to 0x400000 and wrapping around there. If you can afford to do it the other way round, you could misuse the loop instruction which doesn't change flags. Initialize your index to 0x400000, use ecx instead of eax, and to decrement and re-init on zero, use loop forward, mov $0x400000, %ecx, forward: movl %ecx, index. Consider the loop a decrement and jump if not zero. – Guntram Blohm Jul 14 '15 at 19:16