Implementing naive register allocation for x86 machines

Question

I'm writing a toy compiler targeting on x86-64 machines. But I confront several problems when implementing register allocation with linear intermediate representation. I use NASM and GCC to generating executable file.

In intermediate representation of the origin code, we assume there are infinite registers(memory is regarded as a kind of virtual register). I divided them into two types:

Single registers: This kind of virtual register represent a typical variable.
Offset registers: This kind of virtual register is used to represent the access to an element of an array or a member of a struct. For example, when translating instructions such like a.x = 3 and b[i] = 3 into intermediate representation, we may represent it as mov [$a + 0], 3 and mov [$b + $i * 4], 3(4 is the size of an integer), where $a, $b and $i are single registers and [$a + 0] and [$b + $i * 4] are offset registers.

However, in x86-64 machine, memory access can only be represented as [$reg1 + $reg2 * imm1 + imm2], where $reg1 and $reg2 are physical registers and imm1 and imm2 are immediate number. I cannot understand how register allocation algorithms deal with the case that the algorithm marks $va and $vb as spilled node with instruction mov [$va + $vb * 4], 3. In other words, $va and $vb must be a physical register rather than memory access, but if a virtual register is marked as spilled node, it will be regarded as a memory access to the stack frame.

For example, I get the following origin C-like code:

int main() {
    int [] a = new int[10];
    int i = 3;
    a[i] = 4;
}

And I translate it into following intermediate representation:

call malloc, 10
mov $a, $retval       ; $retval is the return value of malloc
mov $i, 3
mov [$a + $i * 4], 4

However, after I have allocated registers, I find the final code becomes:

push rbp
mov rbp, rsp
sub rsp, 16
mov rdi, 10
call malloc
mov qword [rbp-8], rax
mov qword [rbp-16], 3
mov [qword [rbp-8] + qword [rbp-16] * 4], 3  <- Compliation error occurs here

I wonder if there is a nice implementation to solve this problem. Thanks in advance.

UPD: Basile-Starynkevitch gave me a paper showing how GCC solve this problem. And I'm looking for some (probably) simpler methods.

https://stackoverflow.com/q/44212741/841108 got the same question — Basile Starynkevitch, May 27 '17 at 04:08
Did you consider compiling your language to some low level C code and leave the burden of register allocation to the C compiler? — Basile Starynkevitch, May 27 '17 at 04:22
The last line is easily solvable. qword [rbp-8] is already in rax. for qword [rbp-16] * 4 you need a temp. register, like rbx.
=> mov qword [rbp-16] * 4, rbx mov [rax + rbx], 3 — rurban, May 27 '17 at 15:42
@rurban I mean in register allocation algorithms, all registers are used up. In other words, I cannot find such temp register unless I specifically pick up some registers to save the subscription qword [rbp-16] * 4. And you know, In CISC-based machine, the number of registers is less than RISC, which means I cannot pick such temporary registers. — Kipsora Lawrence, May 29 '17 at 15:09
@KipsoraLawrence: Either you have a free physical register, or if not you need to use a stack slot. So if you're out, you need another [rbp-24] instead of rbx. Just the relative addressing will get tricky. If not possible you need to spill to get a free physical register. On 64bit you usually have a free register, just on x86 it will get tricky. — rurban, May 29 '17 at 15:34

Basile Starynkevitch · Answer 1 · 2017-05-27T06:18:24.023

You probably won't get a detailed answer, because your approach is flawed. An entire book is needed to explain that.

So read first the (latest edition of) the Dragon Book. It has several chapters related to your concerns.

Read also the wikipage on register allocation and on A-normal form and SSA form. GCC has many papers and slides on register allocation, e.g. on Graph Coloring RA and on LRA

Your compiler should be made of several passes. For example GCC has more than three hundred passes.

Each pass should produce a valid internal representation and often decorate the internal representation with annotations, or transform it to some other one.

You need to ensure that your internal representations are consistent with your register allocation passes. I guess that you need to add extra passes and more internal representations (including more for spills).

Notice that register allocation is a very difficult subject. For GCC, you can find some people working on it full time for dozen of years.

I would believe that your Offset registers is a flawed concept (and so is your Single registers).

Registers are locations and should hold values (not source variables). For example, t[s.f + k] = t[s.f + k] + 2; (where s is some structure with a field f and k is some integer variable) should be compiled to put s.f + k in a register

Thanks a lot. The paper is what I need. @Basile-Starynkevitch — Kipsora Lawrence, May 27 '17 at 06:09

score 1 · Accepted Answer · answered Jun 01 '17 at 00:27

The primary problem with your approach is that by lowering your code to the level of assembly instructions before you've allocated values to registers, you lose the ability to vary the assembly you need to produce on the basis of where a value needs to come from. If, however, you defined a more abstract representation that could then be translated down to assembly language after register allocation, that would make matters easier.

Consider a toy expression-calculation language, and the expression a + 2*b + c[a+d]. This might be reduced to an intermediate representation that looks something like this:

v1 <= int32[a]       ; meaning: load the 32-bit int value at address "a"
v2 <= int32[b]
v3 <= 2*v2  (~v2)    ; (~nnn) means we no longer need to track this value. we could
                     ; determine this automatically, but for simplicity it's
                     ; easier to have it defined statically
v4 <= v1 + v3 (~v1,v3)
v5 <= int32[a]
v6 <= int32[d]
v7 <= v5 + v6 (~v5,v6)
v8 <= int32[c + 4*v7] (~v7)
v9 <= v4 + v8 (~v4,v8)

Using a very simple (and somewhat naive) register allocator that can only allocate two registers, EAX and EBX (a restriction put in place so that we can show what happens when you run out of registers), you may start by annotating the locations of each live value:

v1 <= int32[a]              [EAX=v1, EBX=-]
v2 <= int32[b]              [EAX=v1, EBX=v2]
v3 <= 2*v2  (~v2)           [EAX=v1, EBX=v3]
v4 <= v1 + v3 (~v1,v3)      [EAX=v4, EBX=-]
v5 <= int32[a]              [EAX=v4, EBX=v5]
v6 <= int32[d]

... OK, so we don't have a register available here. So we need to temporarily store a value. Because this register allocator is hopelessly naive, it doesn't realise that there are better ways than storing a value that came directly from memory into a temporary memory location. It just picks the value that will not be needed for the longest and adds an extra instruction that stores that into temporary storage:

t1 <= v4                     [EAX=-, EBX=v5, t1=v4] 
v6 <= int32[d]               [EAX=v6, EBX=v5, t1=v4]
v7 <= v5 + v6 (~v5,v6)       [EAX=v7, EBX=-, t1=v4]
v8 <= int32[c + 4*v7] (~v7)  [EAX=v8, EBX=-, t1=v4]
v9 <= v4 + v8 (~v4,v8)

So now it needs to use a value from temporary storage, so it inserts another instruction to pull it back in:

v4 <= t1                     [EAX=v8, EBX=v4, t1=v4]
v9 <= v4 + v8 (~v4,v8)       [EAX=v9]

So now the registers are allocated, we translate the intermediate code into assembly language:

; local variables: a = [ebp+4], b=[ebp+8], c=[ebp+12], d=[ebp+40]
; temporary storage available at [ebp-8]
;v1 <= int32[a]              [EAX=v1, EBX=-]
mov eax, dword [ebp+4]
;v2 <= int32[b]              [EAX=v1, EBX=v2]
mov ebx, dword [ebp+8]
;v3 <= 2*v2  (~v2)           [EAX=v1, EBX=v3]
shl ebx, 1
;v4 <= v1 + v3 (~v1,v3)      [EAX=v4, EBX=-]
add eax, ebx
;v5 <= int32[a]              [EAX=v4, EBX=v5]
mov ebx, dword [ebp + 4]
;t1 <= v4                     [EAX=-, EBX=v5, t1=v4] 
mov dword [ebp-8], eax    
;v6 <= int32[d]               [EAX=v6, EBX=v5, t1=v4]
mov eax, dword [ebp + 40]
;v7 <= v5 + v6 (~v5,v6)       [EAX=v7, EBX=-, t1=v4]
add eax, ebx
;v8 <= int32[c + 4*v7] (~v7)  [EAX=v8, EBX=-, t1=v4]
mov eax, dword [ebp + eax*4 + 12] 
;v4 <= t1                     [EAX=v8, EBX=v4, t1=v4]
mov ebx, dword [ebp-8]
;v9 <= v4 + v8 (~v4,v8)       [EAX=v9]
add eax, ebx

The one remaining question is what would happen if the operation for calculating v8 were instead:

v8 <= int32[c + 33*v7 + 4] (~v7)  [EAX=v8, EBX=-, t1=v4]

Then, we'd need to generate a series of instructions, but that wouldn't be a problem. We know that EAX is available, because it's being clobbered by the result of this computation, so we can use it as temporary storage without needing to go through allocation:

imul eax, 33
mov eax, dword [ebp + eax + 16]

Implementing naive register allocation for x86 machines

2 Answers2