2

I'm learning assembler for MicroPython (ARM Thumb2 instruction set for PyBoard).

Is there a quicker way to check the sign (positive/negative) of an FPU register (s0) than this?

@micropython.asm_thumb
def float_array_abs(r0, r1):
    label(LOOP)
    vldr(s0, [r0, 0])
    vmov(r2, s0)         # 1
    cmp(r2, 0)           # 2
    itt(mi)              # 3
    vneg(s0, s0)
    vstr(s0, [r0, 0])
    add(r0, 4)
    sub(r1, 1)
    bgt(LOOP)

This works but it doesn't seem like the 'right' solution (not sure the sign of r2 always matches the sign of s0) and I suspect it must be possible in less than two instructions.

UPDATE 1:

Based on the comments (thanks) I have improved the speed of the code further:

@micropython.asm_thumb
def float_array_abs1(r0, r1):
    label(LOOP)
    ldr(r2, [r0, 0])
    cmp(r2, 0)         # this works for some reason
    bge(SKIP)
    vmov(s0, r2)
    vneg(s0, s0)
    vstr(s0, [r0, 0])  # this can be skipped if not negative
    label(SKIP)
    add(r0, 4)
    sub(r1, 1)
    bgt(LOOP)

But it still leaves the question, is this a robust way of determining sign of an FP value?

For reference here are the byte representations of four float values on my system:

-1.0 0xbf800000
-0.0 0x80000000
 0.0 0x00000000
 1.0 0x3f800000

I guess if this is hardware dependent then I shouldn't be relying on this to determine the sign...

I think this might be the 'proper' way to do it (i.e. proper FPU comparison):

def float_array_abs2(r0, r1):
    mov(r2, 0)
    vmov(s1, r2)
    label(LOOP)
    vldr(s0, [r0, 0])
    vcmp(s0, s1)
    vmrs(APSR_nzcv, FPSCR)
    itt(mi)
    vneg(s0, s0)
    vstr(s0, [r0, 0])
    add(r0, 4)
    sub(r1, 1)
    bgt(LOOP)

But I timed this and it is 11% slower than the code above (float_array_abs1). So it would be nice to use the earlier code if it is a reliable solution.

UPDATE 2:

@Ped7g proposed the method and 0x7FFFFFFF (see comments).

I tested this and it does work. Here is the code:

@micropython.asm_thumb
def float_array_abs3(r0, r1):
    movwt(r3, 0x7FFFFFFF)
    label(LOOP)
    ldr(r2, [r0, 0])
    and_(r2, r3)
    str(r2, [r0, 0])
    add(r0, 4)
    sub(r1, 1)
    bgt(LOOP)

CORRECTION: It is faster than float_array_abs1 above. This appears to be the best solution but is it robust?

Bill
  • 8,217
  • 4
  • 52
  • 75
  • If the floating point value is stored internally as IEEE-754 encoded value (or very similar way), the top bit is signum, i.e. you can test the top bit and when it is non-zero, the value is negative (or it may be one of NaN/Inf/.. special float values, then the test of top bit is insufficient to say what exactly is there). I'm not sure if you can test top bit in `s0` directly, or if there's simple way to copy `s0` value into integer register and do the test there, I already forgot most of the ARM assembly, so this is just suggestion for your consideration (maybe leading nowhere). – Ped7g Apr 01 '18 at 21:05
  • 1
    But if I understood your code a bit, I think you are loading the float value from memory into `s0`, so instead you can skip all the floating point instruction, and load it just as binary value for the top bit test, if all the simplifications of such test are OK with your purpose. – Ped7g Apr 01 '18 at 21:06
  • That explains why the existing code works - the top bit is also the sign bit for an int. – Bill Apr 01 '18 at 22:26
  • 2
    well... that's not 1:1 replacement, because with floating point value you can have "negative" zero, which when tested by top bit only will be reported as negative, and if you attach test like `if (x < 0)` to that, it will fail with `-0` being evaluated as `< 0`, so make sure such simplification does fit your purpose (usually when performance is needed, such shortcuts fits nicely, but as you have in tags something like python, maybe you may need more robust solution, preferring accuracy over performance)... (for example for "abs" being done by compare+neg this is perfect, as it will fix -0 too) – Ped7g Apr 01 '18 at 22:34
  • 1
    And not being completely sure about IEEE-754, but for "array abs" you may actually just do `and array[i],0x7FFFFFFF` for every 32 bit float (or even for the top bytes, skipping the other 3, if the byte access is faster on your platform), avoiding the branching (if that ARM has slower branching than the read-modify-write of every element). Anyway, not sure what you are after, because this doesn't look like ARM thumb assembly, but some kind of high level language trying to disguise as assembly, i.e. lose-lose situation, rather don't waste too much time with it. – Ped7g Apr 01 '18 at 22:36
  • @Ped7g your proposal to `and array[i],0x7FFFFFFF` seems to work. Is that the answer then? Is it a robust solution? (accuracy on all systems is probably more important here). If so, feel free to post this as the answer. – Bill Apr 02 '18 at 03:34
  • that depends, I can't recall from head what bit patterns are used for NaN and +-Inf and other undefined states in IEEE-754, nor did you confirm your FPU does use full IEEE-754 (for example x87 slightly deviates from the standard IIRC, or it was in early versions and now it is not, I really don't care about FPU details that much, I'm just app programmer, where I have to account for FP inaccuracies all the time anyway). What should happen in the corner cases? If you are free to define it in junk-in-junk-out way, then and-ing the top bit is proper solution as far as I can tell. – Ped7g Apr 02 '18 at 07:55
  • Please, take into consideration you are diverting from your original question. The title was about sign test, but the answer is about "array abs", which may have caused certain more knowledgeable people to skip through this in wrong way, not noticing my advices, even if they knew some common pitfalls. I old-school asm coder, used to do everything in integer math anyway, and now just using the FP in common ways, so I have shamefully low idea about those special values and error handling. Maybe even delete this, and repost proper question (about validity/pitfalls), and specify your FPU details. – Ped7g Apr 02 '18 at 07:59
  • And the proposed code with `and` can be in most common situations sped up by 1 instruction by having offset going from negative value (-n*4) to zero (adjusting it by fixed offset +n*4 in ldr/str base ptr). – Ped7g Apr 02 '18 at 08:02

1 Answers1

2

Masking the sign bit to 0 with an and is safe and optimal for IEEE 754 binary floating-point formats like float and double.

It will convert -Inf to +Inf as desired. It will convert -NaN into +NaN, but it's still a NaN.

NaN is indicated by all-ones exponent and non-zero significand. Inf is all-ones exponent with zero significand. (https://en.wikipedia.org/wiki/Single-precision_floating-point_format)

Most code doesn't care about the payload or sign of a NaN, just that it is NaN, so clearing the sign bit is fine.


ARM can do this with integer SIMD NEON instructions for 4 single-precision floats at a time. I don't know about if VFP (non-NEON hardware FPU) supports an AND instruction.

Related: Fastest way to compute absolute value using SSE AND is the best way on x86 as well.


BTW, doing this in a separate loop is probably a waste of memory bandwidth. Doing the absolute value on the fly in loops that read the array is probably best, unless you read this array many times after writing it once. At least if you can do the AND in an FP register. Loading into an integer register for AND and then moving from integer to FP for math instructions would be bad.

Usually you want more computational intensity in your loops (do more ALU work for each load from memory).

Peter Cordes
  • 286,368
  • 41
  • 520
  • 731