Assembly Code - GCC optimized vs not

Question

I want to learn more about how GCC optimizes C programs. I have did a disas of a random function both optimized and unoptimized and I want to look at some of the differences. Off the top of my head, the optimized assembly has less jumps, and seems to use registers mostly, while the unoptimized is using memory more often. What other differences are there to note about these two?

C code

uint countPairsUpTo(int index, int* intArray, int first, int second)
{
  uint i;
  uint sum = 0;

  for (i = 0; i < index; i++)
    if ((first == intArray[i]) && (second == intArray[i+2]))
      sum++;

  return sum ;
}

Unoptimized

0x080485b1 <countPairsUpTo+0>:  push   %ebp
0x080485b2 <countPairsUpTo+1>:  mov    %esp,%ebp
0x080485b4 <countPairsUpTo+3>:  sub    $0x10,%esp
0x080485b7 <countPairsUpTo+6>:  call   0x8048418 <mcount@plt>
0x080485bc <countPairsUpTo+11>: movl   $0x0,-0x4(%ebp)
0x080485c3 <countPairsUpTo+18>: movl   $0x0,-0x8(%ebp)
0x080485ca <countPairsUpTo+25>: jmp    0x80485fa <countPairsUpTo+73>
0x080485cc <countPairsUpTo+27>: mov    -0x8(%ebp),%eax
0x080485cf <countPairsUpTo+30>: shl    $0x2,%eax
0x080485d2 <countPairsUpTo+33>: add    0xc(%ebp),%eax
0x080485d5 <countPairsUpTo+36>: mov    (%eax),%eax
0x080485d7 <countPairsUpTo+38>: cmp    0x10(%ebp),%eax
0x080485da <countPairsUpTo+41>: jne    0x80485f6 <countPairsUpTo+69>
0x080485dc <countPairsUpTo+43>: mov    0xc(%ebp),%edx
0x080485df <countPairsUpTo+46>: add    $0x8,%edx
0x080485e2 <countPairsUpTo+49>: mov    -0x8(%ebp),%eax
0x080485e5 <countPairsUpTo+52>: shl    $0x2,%eax
0x080485e8 <countPairsUpTo+55>: lea    (%edx,%eax,1),%eax
0x080485eb <countPairsUpTo+58>: mov    (%eax),%eax
0x080485ed <countPairsUpTo+60>: cmp    0x14(%ebp),%eax
0x080485f0 <countPairsUpTo+63>: jne    0x80485f6 <countPairsUpTo+69>
0x080485f2 <countPairsUpTo+65>: addl   $0x1,-0x4(%ebp)
0x080485f6 <countPairsUpTo+69>: addl   $0x1,-0x8(%ebp)
0x080485fa <countPairsUpTo+73>: mov    0x8(%ebp),%eax
0x080485fd <countPairsUpTo+76>: cmp    -0x8(%ebp),%eax
0x08048600 <countPairsUpTo+79>: ja     0x80485cc <countPairsUpTo+27>
0x08048602 <countPairsUpTo+81>: mov    -0x4(%ebp),%eax
0x08048605 <countPairsUpTo+84>: leave  
0x08048606 <countPairsUpTo+85>: ret

Optimized

0x08048570 <countPairsUpTo+0>:  push   %ebp
0x08048571 <countPairsUpTo+1>:  mov    %esp,%ebp
0x08048573 <countPairsUpTo+3>:  push   %edi
0x08048574 <countPairsUpTo+4>:  push   %esi
0x08048575 <countPairsUpTo+5>:  push   %ebx
0x08048576 <countPairsUpTo+6>:  call   0x8048418 <mcount@plt>
0x0804857b <countPairsUpTo+11>: mov    0xc(%ebp),%ebx
0x0804857e <countPairsUpTo+14>: mov    0x10(%ebp),%esi
0x08048581 <countPairsUpTo+17>: mov    0x8(%ebp),%ecx
0x08048584 <countPairsUpTo+20>: mov    $0x0,%edi
0x08048589 <countPairsUpTo+25>: test   %ecx,%ecx
0x0804858b <countPairsUpTo+27>: je     0x80485b2 <countPairsUpTo+66>
0x0804858d <countPairsUpTo+29>: mov    $0x0,%edi
0x08048592 <countPairsUpTo+34>: mov    $0x0,%edx
0x08048597 <countPairsUpTo+39>: cmp    %esi,(%ebx,%edx,4)
0x0804859a <countPairsUpTo+42>: jne    0x80485ab <countPairsUpTo+59>
0x0804859c <countPairsUpTo+44>: mov    0x14(%ebp),%eax
0x0804859f <countPairsUpTo+47>: cmp    %eax,0x8(%ebx,%edx,4)
0x080485a3 <countPairsUpTo+51>: sete   %al
0x080485a6 <countPairsUpTo+54>: movzbl %al,%eax
0x080485a9 <countPairsUpTo+57>: add    %eax,%edi
0x080485ab <countPairsUpTo+59>: add    $0x1,%edx
0x080485ae <countPairsUpTo+62>: cmp    %ecx,%edx
0x080485b0 <countPairsUpTo+64>: jne    0x8048597 <countPairsUpTo+39>
0x080485b2 <countPairsUpTo+66>: mov    %edi,%eax
0x080485b4 <countPairsUpTo+68>: pop    %ebx
0x080485b5 <countPairsUpTo+69>: pop    %esi
0x080485b6 <countPairsUpTo+70>: pop    %edi
0x080485b7 <countPairsUpTo+71>: pop    %ebp
0x080485b8 <countPairsUpTo+72>: ret

"Optimized" can mean different things, depending on your goal - optimize for speed, optimize for size, optimize overall... And there are lots of optimization options in GCC. Have you read the documentation about those options? — DCoder, Apr 03 '14 at 04:28
I compiled the optimized program with GCC with O1 optimization. Does that help? — Terry Schmidt, Apr 03 '14 at 04:32
Like everyone mentioned, it could mean many different things, and what gets optimized out or 'in' would depend on how the compiler chooses to interpret the logic of the program. Here is a blog post of one particular feature of optimization called Dead Code Elimination or Code Motion ( http://bangreverse.me/blog/?p=27 ) — gandolf, Jun 10 '14 at 19:12

yaspr · Answer 1 · 2014-06-11T07:00:39.117

Well, there are numerous optimization techniques performed by GCC. Such optimizations go from dead code elimination to loop unrolling, function inlining and many others. Before explaining the optimization techniques I'll start with what compilers do before optimizing.

Optimizations are usually performed on the IR (intermediate representation) of the provided code. Most compilers translate the input code (C, C++, Fortran, ...) into IR. This translation is performed by the *front-end* which feeds the IR to the *middle-end* that will apply optimization passes and feed it up to the *back-end* which will then generate machine code.

In GCC the IR is called GIMPLE and is presented briefly in this link. What GCC also does is convert its GIMPLE representation of the given code into SSA form (Static Single Assignment), which is described in this 1989 publication.

If you follow this link you'll find all of the SSA optimization passes that GCC implements with a brief description. And I think you should also check the RTL passes. Here you'll find two directories full of detailed descriptions of what GCC exactly does. These are the most reliable references you can find since they are from the GCC people.

What you have to know is that some optimizations are performed multiple times and in a different order. For example, dead code elimination, which consists of eliminating islands from the CFG (Control Flow Graph) of the program, can be performed after certain optimizations which can cause islands creation.

Here's an example. Suppose you have a code that looks like this :

   int x = 1;

   //Some code that doesn't alter the value of x

   if (x == 2)
    { BLOCK1; }
   else
      { BLOCK2; }

Now, suppose that the compiler applies optimizations in this order :

Dead code elimination
Branch prediction

The first pass will, of course, find no islands in the CFG. The second one (branch prediction), on the other hand, will figure out that the value of x doesn't change, and that it is different from 2 all the way from its declaration & initialization to its use in the if condition. This means that the whole if statement must be eliminated & replaced by BLOCK2. Thus a dead code elimination pass is necessary after branch prediction.

Another optimization is loop unrolling. It consists of duplicating the loop body and augmenting the stride in order to fill the CPU pipeline and have a better cache locality. For example, this reduction loop can be unrolled 4 times in order to gain some cycles and ameliorate cache access :

   //Ununrolled version
   for (int i = 0; i < N; i++)
       r += t[i];

   //Unrolled version handling any value of N 
   for (int i = 0; i < (N & ~3); i += 4)
     {
        r += t[i]; 
        r += t[i + 1]; 
        r += t[i + 2]; 
        r += t[i + 3]; 
     }

    //Handling the rest of the array elements 
    for (int i = (N & ~3); i < N; i++)
        r += t[i];

If we suppose that t is an array of floats (sizeof(float) = 4bytes) unrolling 4 times implies accessing 16bytes per iteration. If this code is run on an Intel SandyBridge it will perform well, but not great (if prefetching isn't activated). Why ? Because a cache line size on the SandyBridge micro architecture is 64Bytes and the loop accesses 1/4 of a cache line per iteration. Unrolling 16 times will certainly result in much better performance.

The difficulty with this optimization technique is finding the right unroll factor for the right loop. Choosing a value too large or too small can result in performance drops. Also, the underlying architecture plays a very important role (the cache size on a Pentium and a Haswell isn't similar, same thing for the reorder buffers and pipelines). What some compilers do is perform static or dynamic analysis on a loop with different unroll factors and choose the one with the best profile.

These optimizations are usually performed on the IR but they can also be performed on assembly code. There are other optimizations that are tightly related to assembly, for example instructions reordering. This optimization consists of changing the order of instructions in order to have better instruction pipelining or parallel support. Sometimes, if a dependency exists between instructions, and if reordering induces great gains in performance, the dependency will be broken and the instructions reordered.
The code below shows how reordering instructions can result in a much better construct :

   //Ordered
   mov eax , [@1]
   add eax , ecx
   mov [@0], eax    
   mul ecx , ebx
   sub edx , eax

   //Reordered 
   mov eax , [@1]
   add eax , ecx
   sub edx , eax
   mul ecx , ebx
   mov [@0], eax

It is obvious that the ordered & reordered code perform the same operations, but the reordered version will consume less cycles than the ordered one because the add, sub, and mul instructions, being of the same class, will not block one another in the pipeline. You can also notice that memory operations where put in the extremities of the code. This pattern usually results in a better performance than the ones with mixed up instructions, especially when such code is inside a loop or a basic block that is often executed.

There are plenty of interesting optimizations : auto-vectorization, function inlining, memory alignment, ... but unfortunately this page isn't large enough for me to cover them all. You can check GCC's source code and manuals for more information. The references I pointed out above and below are quite helpful. If you manage to digest most of what they have to offer I'm sure you'll find what you are looking for.

score 5 · Answer 2 · edited Jun 10 '14 at 21:08

According to gcc manual, -O1 that you mentioned in comments means turning on the following flags:

-fauto-inc-dec -fcprop-registers -fdce -fdefer-pop 
-fdelayed-branch -fdse -fguess-branch-probability 
-fif-conversion2 -fif-conversion -finline-small-functions 
-fipa-pure-const -fipa-reference -fmerge-constants 
-fsplit-wide-types -ftree-builtin-call-dce -ftree-ccp 
-ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts 
-ftree-dse -ftree-fre -ftree-sra -ftree-ter -funit-at-a-time

You can read more about these flags at gcc man page. I would also recommend (If you didn't do that yet) to read Aho's Dragon books.

Assembly Code - GCC optimized vs not

C code

Unoptimized

Optimized

2 Answers2

Linked