Does Intel binary run faster than arm64 binary on M1 MacBook?

Question

I did a small benchmark on my M1 Macbook and I got a strange result. Intel binary runs faster than arm64 binary. What is wrong with my experiment?

$ arch
arm64
$ make
arch -x86_64 cc fib.c -o x86.out
arch -arm64 cc fib.c -o arm64.out
file *.out
arm64.out: Mach-O 64-bit executable arm64
x86.out:   Mach-O 64-bit executable x86_64
time ./x86.out
Fibonacci 45 is 1134903170
        8.32 real         7.54 user         0.01 sys
time ./arm64.out
Fibonacci 45 is 1134903170
       10.21 real         9.77 user         0.01 sys

$ cat fib.c
#include <stdio.h>
int fibonacci(int n);
int main() {
  int n = 45;
  // Print out number of characters in n
  printf("Fibonacci %i is %i\n", n, fibonacci(n));
  return 0;
}
int fibonacci(int n) {
  if (n == 0 || n == 1) {
    return n;
  } else {
    return fibonacci(n-1) + fibonacci(n-2);
  }
}

Per comments, I have enabled -O3 but Intel bin still runs a bit faster.

$ make
arch -x86_64 cc fib.c -o x86.out -O3
arch -arm64 cc fib.c -o arm64.out -O3
file *.out
arm64.out: Mach-O 64-bit executable arm64
x86.out:   Mach-O 64-bit executable x86_64
time ./x86.out
Fibonacci 45 is 1134903170
        3.46 real         3.33 user         0.00 sys
time ./arm64.out
Fibonacci 45 is 1134903170
        3.57 real         3.45 user         0.00 sys

x86-64 -> ARM64 binary translation is probably doing optimization, but `arch -arm64 cc` is making [anti-optimized / debug-mode](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point) native machine code that runs directly, without any later steps to fix that. — Peter Cordes, Jul 07 '21 at 03:11
[edit] your question with your new info on `-O3` builds. That time difference barely looks statistically significant, just 3.46 real vs. 3.57 real. Is it repeatable even with warm-up runs to get the CPU clock speed and caches ramped up before you run? ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) e.g. run `for i in {1..10}; do time ./a.out; done` and look at the fastest couple of runs for each binary. If there's a real effect, it's not big enough for the binary->binary optimizer to have replaced the dumb double-recursion... — Peter Cordes, Jul 07 '21 at 03:16
This raises the question of how to see what ARM64 machine code the CPU is *actually* running when you execute the x86_64 binary. i.e. the results of dynamic translation. AFAIK MacOS uses a mostly(?) ahead-of-time translation, spending significant CPU time doing an optimizing translation once and caching it for reuse. — Peter Cordes, Jul 07 '21 at 04:20
Also, is that still just the same single set of `-O3` runs you did before, without any of the warm-up / suggestions I mentioned? Running them in the opposite order can be useful. If your CPU is thermally limited, clock speed might drop off a tiny bit some time during testing, making the first one run faster in time if not cycles. So maybe monitor CPU frequency while this runs, or do a long warm-up run so your CPU reaches a steady-state. — Peter Cordes, Jul 07 '21 at 04:29

Does Intel binary run faster than arm64 binary on M1 MacBook?

0 Answers0