Fastest way to count number of 1s in a register, ARM assembly

Question

So I had an interview question before regarding bit manipulation. The company is a well known GPU company. I had very little background in assembly language (weird despite being a phd student in computer architecture) and as this narrative would indicate, I botched it. The question was a simple:

"Write a fast code that will count the number of 1's in a 32-bit register."

Now I am in the process of studying arm assembly. So naturally I revisited this problem again and have come up with this code just by studying the ISA.

For you arm experts out there, is this correct? Is there a faster way to do this? Being a beginner, I naturally think this is incomplete. The AND instruction in "xx" feels redundant but there is no other way to shift a register in ARM isa...

R1 will contain the number of bits at the end while R2 is the register with bits we want to count. r6 is just a dummy register. Comments are enclosed in ()

    MOV   R1, #0                (initialize R1 and R6 to zero)
    MOV   R6, #0        
xx: AND   R6, R6, R2, LSR #1    (Right shift by 1, right most bit is in carry flag)
    ADDCS R1, #1                (Add #1 to R1 if carry  flag is set)
    CMP R2, #0                  (update the status flags if R2 == 0 or not)
    BEQ xx                      (branch back to xx until R2==0)

What is fastest also depends on the number of bits you expect to be set. With very few 1's, [Kernighan's bit counter](http://stackoverflow.com/questions/12383002/please-explain-the-logic-behind-kernighans-bit-counting-algorithm) will win as it runs one round of the loop for each bit set. — Bo Persson, Apr 01 '13 at 11:38
Your algorithm takes `4*32+2` instructions for all `1`. Dave Seal's algorithm takes six instructions plus a memory access; it will probably be as fast for all types of input unless memory is extremely slow. I doubt few people would get Dave Seal's solution. The `reverse subtract` is just weird to me. — artless noise, Apr 01 '13 at 13:39

score 12 · Answer 1 · edited Oct 08 '17 at 19:32

If this code is fast or not depends on the processor. For sure it will be not very fast on Cortex-A8 but may run very fast on Cortex-A9 and newer CPU.

It is however a very short solution.

Expects input in r0, and returns output in r0

  vmov.32 d0[0], r0
  vcnt.8  d0, d0
  vmov.32 r0, d0[0]

  add r0, r0, r0, lsr #16
  add r0, r0, r0, lsr #8
  and r0, r0, #31

The main work is done in the vcnt.8 instruction which counts the bits of each byte in a NEON register and stores the bitcount back into the bytes of D0.

There is no vcnt.32 form, only .8, so you need to horizontally add the 4 bytes together, which is what the rest of the code is doing.

`vmov.32 r0, d0[0]` causes a VERY big sync delay between NEON and Core — user3124812, Aug 12 '16 at 01:18

auselen · Answer 2 · 2015-01-14T11:22:08.087

Best references for bit hacks are

Hacker's Delight (print)
Bit Twiddling Hacks (online)

Bit Twiddling Hacks page says

The best method for counting bits in a 32-bit
integer v is the following:

v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

Then I would suggest you to use gcc and objdump (or this great online gcc tool) to see how this high level snippet would look like as arm instructions.

00000000 <popcount>:
 0: 1043        asrs    r3, r0, #1
 2: f003 3355   and.w   r3, r3, #1431655765 ; 0x55555555
 6: 1ac0        subs    r0, r0, r3
 8: 1083        asrs    r3, r0, #2
 a: f000 3033   and.w   r0, r0, #858993459  ; 0x33333333
 e: f003 3333   and.w   r3, r3, #858993459  ; 0x33333333
12: 18c0        adds    r0, r0, r3
14: eb00 1010   add.w   r0, r0, r0, lsr #4
18: f000 300f   and.w   r0, r0, #252645135  ; 0xf0f0f0f
1c: eb00 2000   add.w   r0, r0, r0, lsl #8
20: eb00 4000   add.w   r0, r0, r0, lsl #16
24: 1600        asrs    r0, r0, #24
26: 4770        bx  lr

So it looks like this gives you result in 12 instructions, which roughly can translate to same amount of cycles.

Comparing integer twiddling above to look up table approach used by libgcc, look up table should be even slower considering extra memory accesses.

00000028 <__popcountSI2>:
28: b410        push    {r4}
2a: 2200        movs    r2, #0
2c: 4c06        ldr r4, [pc, #24]   ; (48 <__popcountSI2+0x20>)
2e: 4613        mov r3, r2
30: fa40 f103   asr.w   r1, r0, r3
34: 3308        adds    r3, #8
36: 2b20        cmp r3, #32
38: b2c9        uxtb    r1, r1
3a: 5c61        ldrb    r1, [r4, r1]
3c: 440a        add r2, r1
3e: d1f7        bne.n   30 <__popcountSI2+0x8>
40: 4610        mov r0, r2
42: bc10        pop {r4}
44: 4770        bx  lr
46: bf00        nop
48: 00000000    andeq   r0, r0, r0
<.. snipped ..>

How can you place a 32-bit constant into a 32-bit instruction? — phuclv, Jan 14 '15 at 09:29
like this `and.w r0, r0, #252645135 ; 0xf0f0f0f` the constant's size is 32 bits — phuclv, Jan 14 '15 at 10:46
@LưuVĩnhPhúc well, that's compilers output so I assume it is very possible. Idea is if constant has low entropy you can stuff that inside an instruction. There should be an efficient way to load say 0x80000000 into a register, otherwise it would be a poor ISA. — auselen, Jan 14 '15 at 11:12
The `and.w` instruction isn't a vanilla AND, you can tell from the `f` prefix in the conditional field of the opcode. In this case, it is given an immediate of a single byte, `0x0f`, which it applies to all 4 bytes of the register input, effectively `0x0f0f0f0f` — davenpcj, Jul 01 '16 at 20:59
@phuclv see Operand2 in ARM. Flexible second operand supports complex packing of form 0xXYXYXYXY, for instance, where X and Y are 4-bit nibbles encoded in the instruction. Also supports variable shift on the constant value. — Thomas O, Apr 14 '20 at 19:48

score 7 · Answer 3 · edited May 23 '17 at 11:46

7

Since this is tagged ARM, the clz instruction is most helpful. The problem is also described as a population count. gcc has __builtin_popcount() for this. As does the ARM tools. There is this link (don't feel bad about your solution, some one made a web page with nearly the same) and also there is Dave Seal's version with six instruction for non-clz ARMs. The clz is advantageous and can be used to produce a faster algorithm, depending on the input.

As well as auselen's good reading suggestion,^{Hacker's Delight} this bit twiddling blog maybe helpful which is talking about such things in a graphic context. At least I found it useful to understand some of Qt's blitting code. However, it has some usefulness in coding a population count routine.

The carry add unit is useful in a divide and conquer sense, making the problem O(ln n). clz is more useful if the data has runs of ones or zeros.

The Hacker's Delight entry has more background on Dave Seal's ARM code.

edited May 23 '17 at 11:46

Community

1
1

answered Apr 01 '13 at 13:17

artless noise

19,723
5
59
95

1

I think __builtin_popcount is compiled from look up table implementation, it should be good enough but nothing specially crafted for arm. – auselen Apr 02 '13 at 08:38
@auselen: check what ARM architecture level you're targeting. `clz` is available only since ARMv5. – Igor Skochinsky Apr 02 '13 at 08:45
@IgorSkochinsky I don't think there is a `clz` implementation provided by `gcc` or `llvm`. `__builtin_popcount()` links to `__popcountsi2`. – auselen Apr 02 '13 at 08:55
Well, popcount does more than `clz` so it needs a helper routine. `__builtin_clz` or `__builtin_ctz` emit `clz` directly for supported architectures. – Igor Skochinsky Apr 02 '13 at 09:09
@IgorSkochinsky may be I should clarify by saying I don't think there is any popcount implementation using `clz` - at least I couldn't see any. – auselen Apr 02 '13 at 09:24
@auselen You are correct. I looked at `libgcc` too, only PowerPC is currently better, but at least you only pay for the table once. I already have 7 edits. I think actually the `mul` will give a better speed increase. `clz` is *linear*. If you can preserve bit count with a hash, you can use a table to get the mapping 2^32 maps to 2^5; or maybe Dave Seal's is already the best *pure ARM* , which wouldn't be a surprise. – artless noise Apr 02 '13 at 13:28
1

For instance, a *population count* on binary image/picture data may benefit from `clz`, especially if it has a lot of *low frequency* content. I think the question was asked in the context of graphics. – artless noise Apr 02 '13 at 13:56

score 5 · Accepted Answer · answered Apr 01 '13 at 02:01

5

You could use a precomputed look up table and reduce the number of iterations to 2 or 4.

You could also use the logarithmic approach.

For more info see this Wikipedia article.

answered Apr 01 '13 at 02:01

Alexey Frunze

59,618
10
77
173

1

Looks like Hamming-weight is a pretty well-known problem. Thanks! – dean Apr 01 '13 at 02:14
2

+1 for the hamming weight article. Note that, specifically for ARM, it mentions that Neon SIMD instructions provide `vcnt`, http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHCFDBJ.html to perform exactly this operation. There are also, for SIMD-type instruction sets. methods to speed it up using shuffling instructions as table lookups (every SSE2 register can be used as a 16-byte lookup table via `pshufb`, or in case of ARM Neon, `vtbl`). See http://stackoverflow.com/questions/3693981/bit-popcount-for-large-buffer-assembly-preferred – FrankH. Apr 01 '13 at 08:07

score 2 · Answer 5 · answered Apr 26 '16 at 21:04

2

    LDR r0, = 0x000000FF;
    MOV r1, #0;
    MOV r3, #0; this will always be zero
    MOV r2,r0;
rep MOVS r2, r2, LSR #1;
    ADC r1,r1, r3;  this adds r1 with zero plus the carry bit
    CMP r2, #0;
    BNE rep

This will do it, r3 is just a dummy register with 0 to make ADC work properly.

answered Apr 26 '16 at 21:04

Avsharian

31
1

Shifting 1 bit at a time is not fast. – Peter Cordes Sep 22 '17 at 10:55
1

@Avsharian: "CMP" is redundant here (do you see why?). The same with r3: #0 would be enough "to make ADC work properly" (as to the initial "LDR", I'm just wondering...). Keep an eye if you're writing in assembly, at least on StackOverflow :) – avs333 Dec 01 '17 at 06:24

score 2 · Answer 6 · answered Oct 08 '17 at 17:50

2

long count_bits_long (long);

    vmov.32 d0[0], r0       // R0 --> SIMD

    vcnt.8  d0, d0          // count bits in bytes
    vpaddl.u8 d0, d0        // add adjacent pairs of bytes and put into 16b words
    vpaddl.u16 d0, d0       // add adjacent pairs of 16b words and put into 32b word

    vmov.32 r0, d0[0]       // SIMD --> R0

    mov pc, lr              // return

answered Oct 08 '17 at 17:50

rick-rick-rick

175
1
5

This is @Nils's answer, but NEON horizontal sum instead of scalar. Which CPUs is this faster on? – Peter Cordes Oct 08 '17 at 19:36

Fastest way to count number of 1s in a register, ARM assembly

6 Answers6

Linked