3

I have a simple tagged union of values. The values can either be int64_ts or doubles. I am performing addition on the these unions with the caveat that if both arguments represent int64_t values then the result should also have an int64_t value.

Here is the code:

#include<stdint.h>
union Value {
  int64_t a;
  double b;
};

enum Type { DOUBLE, LONG };

// Value + type.
struct TaggedValue {
  Type type;
  Value value;
};

void add(const TaggedValue& arg1, const TaggedValue& arg2, TaggedValue* out) {
  const Type type1 = arg1.type;
  const Type type2 = arg2.type;
  // If both args are longs then write a long to the output.
  if (type1 == LONG && type2 == LONG) {
    out->value.a = arg1.value.a + arg2.value.a;
    out->type = LONG;
  } else {
    // Convert argument to a double and add it.
    double op1 = type1 == LONG ? (double)arg1.value.a : arg1.value.b; // Why isn't CMOV used?
    double op2 = type2 == LONG ? (double)arg2.value.a : arg2.value.b; // Why isn't CMOV used? 
    out->value.b = op1 + op2;
    out->type = DOUBLE;
  }
}

The output of gcc at -O2 is here: http://goo.gl/uTve18 Attached here in case the link doesn't work.

add(TaggedValue const&, TaggedValue const&, TaggedValue*):
    cmp DWORD PTR [rdi], 1
    sete    al
    cmp DWORD PTR [rsi], 1
    sete    cl
    je  .L17
    test    al, al
    jne .L18
.L4:
    test    cl, cl
    movsd   xmm1, QWORD PTR [rdi+8]
    jne .L19
.L6:
    movsd   xmm0, QWORD PTR [rsi+8]
    mov DWORD PTR [rdx], 0
    addsd   xmm0, xmm1
    movsd   QWORD PTR [rdx+8], xmm0
    ret
.L17:
    test    al, al
    je  .L4
    mov rax, QWORD PTR [rdi+8]
    add rax, QWORD PTR [rsi+8]
    mov DWORD PTR [rdx], 1
    mov QWORD PTR [rdx+8], rax
    ret
.L18:
    cvtsi2sd    xmm1, QWORD PTR [rdi+8]
    jmp .L6
.L19:
    cvtsi2sd    xmm0, QWORD PTR [rsi+8]
    addsd   xmm0, xmm1
    mov DWORD PTR [rdx], 0
    movsd   QWORD PTR [rdx+8], xmm0
    ret

It produced code with a lot of branches. I know that the input data is pretty random i.e it has a random combination of int64_ts and doubles. I'd like to have at least the conversion to a double done with an equivalent of a CMOV instruction. Is there any way I can coax gcc to produce that code? I'd ideally like to run some benchmark on real data to see how the code with a lot of branches does vs one with fewer branches but more expensive CMOV instructions. It might turn out that the code generated by default by GCC works better but I'd like to confirm that. I could inline the assembly myself but I'd prefer not to.

The interactive compiler link is a good way to check the assembly. Any suggestions?

Rajiv
  • 2,547
  • 2
  • 19
  • 31
  • 1
    The CMOV instruction could use more cycles on your platform? It could have something to do with pipelining? Or something else completely. Just because there exists instructions in the instruction set that seem to be more convenient doesn't actually mean they are better in all cases. – Some programmer dude May 19 '15 at 18:18
  • 1
    @JoachimPileborg no doubt that CMOV could be more expensive. I wanted to know if there is a way to coax GCC to output a version with CMOV instructions just so that I could compare. – Rajiv May 19 '15 at 18:20
  • 6
    @Rajiv: How do you propose to use `CMOV` for this? Before demanding that the compiler be able to optimize in this specific way, why don't you check that it is even feasible? (Hint: It's not. You would need to unconditionally convert to `double`, then `CMOV` the double-type from the union. x86 has only integer `CMOV`, and `FCMOV` in the obsolete x87 FPU, not in `SSE/AVX`. Transferring the values to integer registers, `CMOV`ing and moving back to `xmm`-registers would be insane) – EOF May 19 '15 at 23:53
  • @EOF thanks that makes sense. – Rajiv May 20 '15 at 00:51

0 Answers0