I would like to translate a simple x86_64 machine code into LLVM IR, which can be later analyzed. For my particular use case, I need to be able to work with just instructions and opcodes directly, and I don't have access to the binary itself.
To my current understanding, I should be able to convert x86 instructions using tools such as rellume and remill. With their help, I am able to create LLVM IR code, however, I am not entirely sure whether the results I am getting are correct.
First I need to create machine code for a very simple application (this is just for testing purposes):
Compile the source code [1]
gcc simple.c -o simple.o
Dissemble using
objdump[2]objdump -d simple.o- At this point, I get separate functions
addandmain
Then, I provide a function that I want to translate into LLVM IR to remill as bytes:
- Translate
addfunction into LLVM usingremillbytes=addfunction as bytes- the result should be a LLVM IR of the
addfunction
docker run --rm \
-it remill \
--arch amd64 \
--ir_out /dev/stdout \
--bytes f30f1efa554889e54883ec10be03000000bf01000000e8cdffffff8945fc8b45fcc9c3
My questions:
- Is my current workflow to translate x86 instructions into LLVM IR correct? Am I missing something? (I am aware of tools such as
McSema, however, for my use case I need to be able to transform opcodes). - How can I verify the produced LLVM IR?
- After producing LLVM IR of an even simpler example [3], I tried to run it with
lliunsuccessfully.
- After producing LLVM IR of an even simpler example [3], I tried to run it with
- Source code
int add(int a, int b){
return a + b;
}
int main()
{
int c = add(1, 3);
return c;
}
- Dump of
objdump
...
0000000000001129 <add>:
1129: f3 0f 1e fa endbr64
112d: 55 push %rbp
112e: 48 89 e5 mov %rsp,%rbp
1131: 89 7d fc mov %edi,-0x4(%rbp)
1134: 89 75 f8 mov %esi,-0x8(%rbp)
1137: 8b 55 fc mov -0x4(%rbp),%edx
113a: 8b 45 f8 mov -0x8(%rbp),%eax
113d: 01 d0 add %edx,%eax
113f: 5d pop %rbp
1140: c3 ret
0000000000001141 <main>:
1141: f3 0f 1e fa endbr64
1145: 55 push %rbp
1146: 48 89 e5 mov %rsp,%rbp
1149: 48 83 ec 10 sub $0x10,%rsp
114d: be 03 00 00 00 mov $0x3,%esi
1152: bf 01 00 00 00 mov $0x1,%edi
1157: e8 cd ff ff ff call 1129 <add>
115c: 89 45 fc mov %eax,-0x4(%rbp)
115f: 8b 45 fc mov -0x4(%rbp),%eax
1162: c9 leave
1163: c3 ret
...
int main
{
int val = 2
return val;
}