IDA Microcode notes

References

Ilfak's presentation at Recon 2018
Microcode in pictures
Hex-Rays Microcode API vs. Obfuscating Compiler
Scripts vds10, vds11, vds12 and vds13 from Hex-Rays SDK

Implementation

Each Microcode instruction is implemented as a microinstruction (minsn_t class) and its operands as microoperands (mop_t class). A microinstruction contains 3 operands: left (l), right (r) and a destination (d). The effect of a microinstruction is defined by its attribute opcode (instance of mcode_t, see enum at the end). An operand can hold different informations depending on its type (number, register, stack variable etc...).

Example of microinstruction: add eax.4, #8.4, ecx.4 (add constant 8 to EAX and place the result into ECX).

Similarly to the CTREE, each instance of minsn_t can be composed of different mop_t and minsn_t (nested) depending on its complexity.

A Microcode basic block is a double linked list of microinstructions -> instance of mblock_t.

A function Microcode object is a double linked list of microblocks -> instance of mbl_array_t (commonly called mba). It keeps also some info about the decompiled code (local variables in mba->vars for example).

A last class mlist_t is used to keep list of locations (heavily used to represent Use-Def lists)

Different maturity levels

There are 8 different maturity levels (from the lowest to the highest):

MMAT_GENERATED: immediately after generation
MMAT_PREOPTIMIZED
MMAT_LOCOPT: after local optimizations
MMAT_CALLS: after analysis of function calls
MMAT_GLBOPT1
MMAT_GLBOPT2
MMAT_GLBOPT3
MMAT_LVARS

MMAT_GENERATED

First one is generated from the assembly listing, the syntax is pretty similar to assembly language. There are mov (operands reversed), push/pop for stack operations, und for unimplemented instructions, call and so long... All in all there are 72 microinstructions (to confirm).

The syntax for the operands is a bit different, a suffix corresponding to the size (in bytes) of the operand is added (ex: eax.4, #0xC5E2B1C.4).

Assembly blocks are split when a call instruction is met -> Microcode basic blocks != assembly basic blocks.

A Use-Def pass is also applied on the assembly, per-instruction and per-block.

Example: here is the first block of a function using some obfuscations techniques.

1. 0 mov    esp.4, ebp.4            ; 10018361 u=esp.4      d=ebp.4
1. 1 mov    #0.1, cf.1              ; 10018371 u=           d=cf.1
1. 2 mov    #0.1, zf.1              ; 10018371 u=           d=zf.1
1. 3 mov    #0.1, of.1              ; 10018371 u=           d=of.1
1. 4 und    sf.1                    ; 10018371 u=           d=sf.1
1. 5 und    pf.1                    ; 10018371 u=           d=pf.1
1. 6 mov    #0.1, cf.1              ; 1001837C u=           d=cf.1
1. 7 mov    #0.1, of.1              ; 1001837C u=           d=of.1
1. 8 und    zf.1                    ; 1001837C u=           d=zf.1
1. 9 und    sf.1                    ; 1001837C u=           d=sf.1
1.10 und    pf.1                    ; 1001837C u=           d=pf.1
1.11 mov    ecx.4, esi.4            ; 1001838E u=ecx.4      d=esi.4
1.12 nop                            ; 10018390 u=
1.13 push   esi.4                   ; 10018390 u=esp.4,esi.4 d=esp.4,sp+C.4
1.14 mov    #0x41B21DBA.4, et1.4    ; 10018391 u=           d=et1.4
1.15 mov    et1.4, esi.4            ; 10018391 u=et1.4      d=esi.4
1.16 mov    #0x15CF5AD2.4, et1.4    ; 10018396 u=           d=et1.4
1.17 mov    et1.4, esi.4            ; 10018396 u=et1.4      d=esi.4
1.18 mov    #0x2E960483.4, et1.4    ; 1001839B u=           d=et1.4
1.19 mov    et1.4, esi.4            ; 1001839B u=et1.4      d=esi.4
1.20 mov    #0x1CBD43F8.4, et1.4    ; 100183A0 u=           d=et1.4
1.21 mov    et1.4, esi.4            ; 100183A0 u=et1.4      d=esi.4
1.22 nop                            ; 100183A5 u=
1.23 pop    esi.4                   ; 100183A5 u=esp.4,sp+C.4 d=esp.4,esi.4
1.24 mov    #0x100183AB.4, ett.4    ; 100183A6 u=           d=ett.4
1.25 nop                            ; 100183A6 u=
1.26 nop                            ; 100183A6 u=
1.27 mov    cs.2, seg.2             ; 100183A6 u=cs.2       d=seg.2
1.28 mov    #0x102287B2.4, eoff.4   ; 100183A6 u=           d=eoff.4
1.29 call   $_rand                  ; 100183A6 u=(rax.8,rcx.8,ebp.4,rdi.8,st0.8,st1.8...)

MMAT_PREOPTIMIZED

Use-Def information is used to:

delete dead code (if the instruction result is not used, the instruction is deleted)
propagate operands and instructions across blocks boundaries
generate assertions for future optimizations -> add new microinstructions to represent assertions (assert in comment)

Example: same block as before.

1. 0 mov    esp.4, ebp.4            ; 10018361 u=esp.4      d=ebp.4
1. 1 mov    #0.1, cf.1              ; 1001837C u=           d=cf.1
1. 2 mov    #0.1, of.1              ; 1001837C u=           d=of.1
1. 3 und    zf.1                    ; 1001837C u=           d=zf.1
1. 4 und    sf.1                    ; 1001837C u=           d=sf.1
1. 5 und    pf.1                    ; 1001837C u=           d=pf.1
1. 6 mov    ecx.4, esi.4            ; 1001838E u=ecx.4      d=esi.4
1. 7 call   $_rand                  ; 100183A6 u=(rax.8,rcx.8,ebp.4,rdi.8,st0.8,st1.8...)

2nd example: second block

2. 0 ldx    ds.2, (esi.4+#0x24.4), eax.4 ; 100184B3 u=esi.4,ds.2,(GLBLOW,LVARS,ARGS,GLBHIGH) d=eax.4
2. 1 add    eax.4, #0xF.4, eax.4    ; 100184B6 u=eax.4      d=eax.4
2. 2 mul    #3.4, eax.4, ecx.4      ; 100184B9 u=eax.4      d=ecx.4
2. 3 mov    #0x2AAAAAAB.4, eax.4    ; 100184BC u=           d=eax.4
2. 4 mul    #2.4, ecx.4, ecx.4      ; 100184C1 u=ecx.4      d=ecx.4
2. 5 mul    xds.8(ecx.4), xds.8(eax.4), rt2.8 ; 100184C3 u=eax.4,ecx.4 d=rt2.8
2. 6 mov    rt2^4.4, edx.4          ; 100184C3 u=rt2^4.4    d=edx.4
2. 7 ldx    ds.2, (esi.4+#0x24.4), eax.4 ; 100184C5 u=esi.4,ds.2,(GLBLOW,LVARS,ARGS,GLBHIGH) d=eax.4
2. 8 mov    edx.4, ecx.4            ; 100184C8 u=edx.4      d=ecx.4
2. 9 shr    ecx.4, #0x1F.1, ecx.4   ; 100184CA u=ecx.4      d=ecx.4
2.10 add    edx.4, ecx.4, ecx.4     ; 100184CD u=edx.4,ecx.4 d=ecx.4
2.11 sub    ecx.4, eax.4, ecx.4     ; 100184CF u=eax.4,ecx.4 d=ecx.4
2.12 setb   ecx.4, #0xF.4, cf.1     ; 100184D1 u=ecx.4      d=cf.1
2.13 seto   ecx.4, #0xF.4, of.1     ; 100184D1 u=ecx.4      d=of.1
2.14 setz   ecx.4, #0xF.4, zf.1     ; 100184D1 u=ecx.4      d=zf.1
2.15 setp   ecx.4, #0xF.4, pf.1     ; 100184D1 u=ecx.4      d=pf.1
2.16 sets   (ecx.4-#0xF.4), sf.1    ; 100184D1 u=ecx.4      d=sf.1
2.17 jcnd   ((sf.1 ^ of.1) | zf.1), $loc_100187A8 ; 100184D4 u=zf.1,sf.1,of.1

MMAT_LOCOPT

Propagation pass again with the new Use-Def listing. Define local variables names for stack pointer offsets.

Example: 2nd example from before (1st one is unchanged)

2. 0 high   (#0x2AAAAAAB.8*xds.8((#6.4*([ds.2:(esi.4+#0x24.4)].4+#0xF.4)))), edx.4 ; 100184C3 u=esi.4,ds.2,(GLBLOW,LVARS,ARGS,GLBHIGH) d=edx.4
2. 1 ldx    ds.2, (esi.4+#0x24.4), eax.4 ; 100184C5 u=esi.4,ds.2,(GLBLOW,LVARS,ARGS,GLBHIGH) d=eax.4
2. 2 sub    ((#6.4*([ds.2:(esi.4+#0x24.4)].4+#0xF.4)) /s #6.4), eax.4, ecx.4 ; 100184CF u=eax.4,esi.4,ds.2,(GLBLOW,LVARS,ARGS,GLBHIGH) d=ecx.4
2. 3 setb   ecx.4, #0xF.4, cf.1     ; 100184D1 u=ecx.4      d=cf.1
2. 4 seto   ecx.4, #0xF.4, of.1     ; 100184D1 u=ecx.4      d=of.1
2. 5 setz   ecx.4, #0xF.4, zf.1     ; 100184D1 u=ecx.4      d=zf.1
2. 6 setp   ecx.4, #0xF.4, pf.1     ; 100184D1 u=ecx.4      d=pf.1
2. 7 sets   (ecx.4-#0xF.4), sf.1    ; 100184D1 u=ecx.4      d=sf.1
2. 8 jle    ecx.4, #0xF.4, @6       ; 100184D4 u=ecx.4

It propagates all the instructions from 0 to 6 in 1 instruction, and same for 8 to 11.

MMAT_CALLS

Propagations pass again + ABI is retrieved for calls so arguments can be identified and new assertions can be made.

Example: first block

1. 0 mov    &(%" s").4, ebp.4       ; 10018361 u=           d=ebp.4
1. 1 mov    ecx.4{1}, esi.4{1}      ; 1001838E u=ecx.4      d=esi.4
1. 2 call   $_rand <cdecl:>.0       ; 100183A6 u=(GLBLOW,sp+2E8..,GLBHIGH) d=(cf.1,zf.1...)

Example: second block

jle    (((#6.4*([ds.2{3}:(esi.4{1}+#0x24.4){4}].4{2}+#0xF.4)) /s #6.4)-[ds.2{3}:(esi.4{1}+#0x24.4){4}].4{2}), #0xF.4, @6

Instruction at 0 was removed because not used anywhere, 1 and 2 have been propagated into 8 instruction. Flags are not used so they were removed as well.

MMAT_GLBOPT1

Blocks are merged (if split because of calls) and function prologue/epilogue are removed.

Example: 2 first blocks

1. 0 mov    ecx.4{1}, esi.4{1}      ; 1001838E u=ecx.4      d=esi.4
1. 1 call   $_rand <cdecl:>.0       ; 100183A6 u=(GLBLOW,GLBHIGH) d=(cf.1,zf.1...)
1. 2 jle    (((#6.4*([ds.2{3}:(esi.4{1}+#0x24.4){4}].4{2}+#0xF.4)) /s #6.4)-[ds.2{3}:(esi.4{1}+#0x24.4){4}].4{2}), #0xF.4, @3

MMAT_GLBOPT2

No differences observed.

MMAT_GLBOPT3

No differences observed.

MMAT_LVARS

SSA pass is applied and registers considered as local variables.