1

I have ~20,000 .asm files from IDA pro output via hex-rays.

These were all created from known malware, and all from 32bit Windows Portable Executables.

I do not have the original executables, just the disassembled output(.asm) files.

  • What I am trying to obtain is a list of any possible mnemonics (i.e. add, xor, jump, etc..) ,that IDA could output into an .asm file

    With this list I will be attempting a machine learning/ malware classification task using grep (or similar) to compile statistics.

Inspecting them visually I have hand crafted a list of 30 or so ( jmp, push,mov, call, lea.. etc etc) with help from this site, which list common instructions http://www.strchr.com/x86_machine_code_statistics.

Are there any clues in the headers of these files which could assist in defining possible mnemonics ? Are these consistent across platforms or specific to some attribute of the original file?

I searched IDA pros documentation, and it seem all the functionality for this is available during the disassembling process, but I am stuck with the .asm files to parse.

similar questions with no help.

Parsing IDA Pro .asm files

IDA Pro List of Functions with Instruction

sample .asm Header

       ;
       ; +-------------------------------------------------------------------------+
       ; |   This file has been generated by The Interactive Disassembler (IDA)    |
       ; |       Copyright (c) 2013 Hex-Rays, <support@hex-rays.com>       |
       ; |          License info:                              |
       ; |                Microsoft                |
       ; +-------------------------------------------------------------------------+
       ;

       ; ---------------------------------------------------------------------------
       ; Format      : Portable executable for 80386 (PE)
       ; Imagebase   : 400000
       ; Section 1. (virtual address 00001000)
       ; Virtual size              : 0002964D ( 169549.)
       ; Section size in file          : 00029800 ( 169984.)
       ; Offset to raw data for section: 00000400
       ; Flags 60000020: Text Executable Readable
       ; Alignment     : default
       ; OS type     :  MS Windows
       ; Application type:  Executable 32bit

               include uni.inc ; see unicode subdir of ida for info on unicode

               .686p
               .mmx
               .model flat

       ; ===========================================================================

sample from inside

.text:00401080                             ; ---------------------------------------------------------------------------
.text:00401081 CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC            align 10h
.text:00401090 8B 44 24 10                             mov     eax, [esp+10h]
.text:00401094 8B 4C 24 0C                             mov     ecx, [esp+0Ch]
.text:00401098 8B 54 24 08                             mov     edx, [esp+8]
.text:0040109C 56                                  push    esi
.text:0040109D 8B 74 24 08                             mov     esi, [esp+8]
.text:004010A1 50                                  push    eax
.text:004010A2 51                                  push    ecx
.text:004010A3 52                                  push    edx
.text:004010A4 56                                  push    esi
.text:004010A5 E8 18 1E 00 00                              call    _memcpy_s
.text:004010AA 83 C4 10                                add     esp, 10h
.text:004010AD 8B C6                                   mov     eax, esi
.text:004010AF 5E                                  pop     esi
.text:004010B0 C3                                  retn
.text:004010B0                             ; ---------------------------------------------------------------------------

Thanks for any pointers or clues as to the best way to approach this and my apologies if this isn't suitable for this forum.

T. Scharf
  • 121
  • 5

1 Answers1

3

As I'm working with the malware samples provided by kaggle too, I faced the same problem. I found a solution by the processing in two steps, which extracts all the mnemonics used in the complete set.

Note: As I'm not finished with my work yet, I'm not able to post the full script. The real implementation is realized with threading and the process takes roughly one hour for all 9 families. Addtionally the solution is not perfect and with good performance - rather a dirty fix.


1. Step: Roughly cleaning the IDA listing format of an INPUT.ASM into an OUTPUT.ASM (extraction from my script; see the discussion for this step here)

Note: It should be mentioned that ignore dd like instructions. Additionally I keep the subroutines and basic blocks delimeted by ==== and -----.

grep -E '^.text:*' INPUT.ASM | grep -v align | grep -E '^.{10,15}[0-9A-F]{2} *|=======================|-----------------------------------' | sed 's/\t/           /g' | grep -v ' dq ' | grep -v ' dd ' | grep -v ' db ' | grep -v ' dw ' | cut -c100-200 |  sed -e 's/^[ \t]*//' | tr -s [:blank:] | cut -d ';' -f1 > OUTPUT1.ASM

2. Step: Process the cleaned OUTPUT.ASM in python (extraction from my script)

#!/usr/bin/python
mneLocal = set()
with open('OUTPUT.ASM') as oFile:
    for line in oFile.readlines():
        mne = line.split(" ")[0]
        if mne[0] != '-' and mne[0] != '=' and len(mne)≤6 and not mne[0].isdigit() and mne.islower():
            mneLocal.add(mne)
print(mneLocal)

3. Output: Applied on the Ramnit dataset

set(['jns', 'fbstp', 'jnp', 'rol', 'psrlw', 'fld1', 'jnz', 'movd', 'imul', 'lds', 'jnb', 'psrlq', 'cdq', 'psrld', 'pand', 'pfmax', 'ror', 'fxch', 'jno', 'dt', 'fisub', 'movq', 'cmps', 'arpl', 'pi2fd', 'pfmin', 'cld', 'nop', 'pf2id', 'maxss', 'add', 'jcxz', 'adc', 'fadd', 'pf2iw', 'fistp', 'setbe', 'aad', 'maxps', 'fmulp', 'movzx', 'fdivp', 'fdivr', 'femms', 'not', 'repe', 'cmc\r\n', 'svts', 'repne', 'shr', 'pfadd', 'sgdt', 'mulps', 'leave', 'div', 'mulpd', 'shl', 'btc', 'cmp', 'rcpps', 'psubd', 'psubb', 'bts', 'btr', 'loope', 'jle', 'pandn', 'fist', 'out', 'fstcw', 'cbw\r\n', 'xor', 'sub', 'neg', 'rep', 'lddqu', 'jge', 'movs', 'pfrcp', 'fdiv', 'jecxz', 'xchg', 'mul', 'pavgb', 'lea', 'ficom', 'pfsub', 'jz', 'addpd', 'jp', 'subsd', 'js', 'bt', 'fidiv', 'daa\r\n', 'jo', 'clc\r\n', 'lods', 'jg', 'ja', 'jb', 'addps', 'jl', 'cmovz', 'movsd', 'cld\r\n', 'xorpd', 'les', 'cmovl', 'subss', 'movsx', 'xlat', 'cmova', 'cmovb', 'nop\r\n', 'sbb', 'or', 'cmovg', 'shrd', 'fsub', 'por', 'bound', 'pop', 'setnb', 'fmul', 'pabsw', 'subps', 'minsd', 'minss', 'sti\r\n', 'xadd', 'cdq\r\n', 'setnl', 'retf', 'faddp', 'retn', 'rcr', 'rcl', 'pslld', 'call', 'setnz', 'das\r\n', 'aas\r\n', 'setns', 'setnp', 'sldt', 'ptest', 'fcomi', 'divps', 'jmp', 'rcpss', 'ffree', 'lgdt', 'pfacc', 'utes', 'shld', 'fcomp', 'fsave', 'psraw', 'aam', 'subpd', 'fstsw', 'psrad', 'pxor', 'fsubp', 'fsubr', 'fldcw', 'dec', 'fld', 'loop', 'and', 'addsd', 'cmovs', 'fldz', 'psubq', 'sal', 'int', 'lock', 'andpd', 'in', 'fucom', 'ud2\r\n', 'addss', 'fild', 'sar', 'scas', 'psllw', 'andps', 'bswap', 'inc', 'mulss', 'paddd', 'std\r\n', 'paddb', 'psubw', 'stc\r\n', 'idiv', 'psllq', 'paddw', 'cli\r\n', 'mulsd', 'paddq', 'test', 'setp', 'fiadd', 'hnt', 'orpd', 'enter', 'minps', 'bsr', 'mov', 'orps', 'fstp', 'xorps', 'setle', 'bsf', 'fo', 'pfmul', 'movss', 'setb', 'aaa\r\n', 'setl', 'divsd', 'fimul', 'seto', 'fcom', 'hlt\r\n', 'jbe', 'fst', 'divss', 'sets', 'push', 'pavgw', 'setz'])
knx
  • 1,257
  • 1
  • 9
  • 26