Skip to content

Instantly share code, notes, and snippets.

@wideglide
Created February 24, 2022 17:43
Show Gist options
  • Save wideglide/d38fd6d19be68cad57a7f8c9b07d0378 to your computer and use it in GitHub Desktop.
Save wideglide/d38fd6d19be68cad57a7f8c9b07d0378 to your computer and use it in GitHub Desktop.
Identified issues with evaluating function boundaries of disassemblers.
  1. Definition of function start and end.
  2. Functions with exception handlers.
  3. Functions like thunk functions (name?).

These examples are from the ByteWeight dataset using the file bap-dataset/pe-x86-64/binary/msvs_whatever_64_O2_vim.

Functions where end == start

The function bounds provided (ground truth) are:

 START     END
 14002056c 1400205ad
 1400205b0 1400205c8
 1400205c8 1400206f8
 1400206f8 140020700
 140020700 1400208cf

The END is the first byte after the last instruction of the fuction. In cases where there is no gap between functions, FN[n].END == FN[n+1].START. This is true for the function that starts @ 0x1400206f8. In the example binary (vim) there are 302 occurances where end == start, all of these default to labeling the first byte of a function as a function end.

Unfortunately, if you chose to label each byte with one label, you cannot actually represent ground truth accurately unless you mark the function end as the last byte of the last instruction.

  • FunctionStart (F)
  • FunctionEnd (E)
  • FunctionBody
   140020592:   3d 22 05 93 19          cmp    eax,0x19930522
   140020597:   74 07                   je     0x1400205a0
   140020599:   3d 00 40 99 01          cmp    eax,0x1994000
   14002059e:   75 06                   jne    0x1400205a6
   1400205a0:   e8 d7 b3 ff ff          call   0x14001b97c
   1400205a5:   cc                      int3
   1400205a6:   33 c0                   xor    eax,eax
   1400205a8:   48 83 c4 28             add    rsp,0x28
   1400205ac:   c3                      ret
E  1400205ad:   cc                      int3
   1400205ae:   cc                      int3
   1400205af:   cc                      int3
F  1400205b0:   48 83 ec 28             sub    rsp,0x28
   1400205b4:   48 8d 0d b1 ff ff ff    lea    rcx,[rip+0xffffffffffffffb1]        # 0x14002056c
   1400205bb:   ff 15 c7 1f 1e 00       call   QWORD PTR [rip+0x1e1fc7]        # 0x140202588
   1400205c1:   33 c0                   xor    eax,eax
   1400205c3:   48 83 c4 28             add    rsp,0x28
   1400205c7:   c3                      ret
EF 1400205c8:   48 89 5c 24 08          mov    QWORD PTR [rsp+0x8],rbx
   1400205cd:   48 89 6c 24 10          mov    QWORD PTR [rsp+0x10],rbp
   1400205d2:   48 89 74 24 18          mov    QWORD PTR [rsp+0x18],rsi
   1400205d7:   57                      push   rdi
   1400205d8:   48 83 ec 30             sub    rsp,0x30
   1400205dc:   83 3d c5 70 25 00 00    cmp    DWORD PTR [rip+0x2570c5],0x0        # 0x1402776a8
   1400205e3:   75 05                   jne    0x1400205ea
   1400205e5:   e8 0e 41 ff ff          call   0x1400146f8

The same snippet from radare2:

SHOW
     ││││   0x140020592      3d22059319     cmp eax, 0x19930522
 ┌─────< 0x140020597      7407           je 0x1400205a0
 │││││   0x140020599      3d00409901     cmp eax, 0x1994000
┌──────< 0x14002059e      7506           jne 0x1400205a6
││││││   ; CODE XREFS from fcn.140020518 @ +0x71, +0x78, +0x7f
│└└└───> 0x1400205a0      e8d7b3ffff     call fcn.14001b97c         ;[1]
│   ││   0x1400205a5      cc             int3
│   ││   ; CODE XREFS from fcn.140020518 @ +0x61, +0x67, +0x86
└───└└─> 0x1400205a6      33c0           xor eax, eax
         0x1400205a8      4883c428       add rsp, 0x28
         0x1400205ac      c3             ret
         0x1400205ad      cc             int3
         0x1400205ae      cc             int3
         0x1400205af      cc             int3
         0x1400205b0      4883ec28       sub rsp, 0x28
         0x1400205b4      488d0db1ffff.  lea rcx, [0x14002056c]
         0x1400205bb      ff15c71f1e00   call qword [sym.imp.KERNEL32.dll_SetUnhandledExceptionFilter] ;[2] ; [0x140202588:8]=0x230436 reloc.KERNEL32.dll_SetUnhandledExceptionFilter ; "6\x04#" ; LPTOP_LEVEL_EXCEPTION_FILTER SetUnhandledExceptionFilter(LPTOP_LEVEL_EXCEPTION_FILTER lpTopLevelExceptionFilter)
         0x1400205c1      33c0           xor eax, eax
         0x1400205c3      4883c428       add rsp, 0x28
         0x1400205c7      c3             ret
         ; CALL XREF from entry0 @ 0x14000e163
304: fcn.1400205c8 ();
           ; var int64_t var_20h @ rsp+0x20
           ; var int64_t var_8h @ rsp+0x40
           ; var int64_t var_10h @ rsp+0x48
           ; var int64_t var_18h @ rsp+0x50
0x1400205c8      48895c2408     mov qword [var_8h], rbx
0x1400205cd      48896c2410     mov qword [var_10h], rbp
0x1400205d2      4889742418     mov qword [var_18h], rsi
0x1400205d7      57             push rdi
0x1400205d8      4883ec30       sub rsp, 0x30
0x1400205dc      833dc5702500.  cmp dword [0x1402776a8], 0    ; [0x1402776a8:4]=0
│       ┌─< 0x1400205e3      7505           jne 0x1400205ea
│       │   0x1400205e5      e80e41ffff     call fcn.1400146f8         ;[3]

Functions with exception handlers

IDA identifies certain code blocks as exception handlers and assigns them to a parent function. Radare2 does not identify these as functions.

Here is an example:

ground truth marks the block on the right as a function, which is resonable in this case.

 START     END
 1402019b7 1402019d0
 1402019d0 1402019eb
 1402019eb 140201a06   <<==
 140201a06 140201a21
 140201a21 140201a3c

image

Functions with unconditional jumps?

For an unknown reason, this example is defined as two functions. I don't know how or why you would define a function this way.

Provided ground truth in BAP:

 START     END
 14001ecb0 14001ecc8
 14001ecd0 14001ecd1
            ; CALL XREF from fcn.140013358 @ 0x140013439
25: fcn.14001ecb0 (int64_t arg1, int64_t arg2, int64_t arg3);
           ; var int64_t var_8h @ rsp+0x8
           ; var int64_t var_10h @ rsp+0x10
           ; var int64_t var_18h @ rsp+0x18
           ; arg int64_t arg1 @ rcx
           ; arg int64_t arg2 @ rdx
           ; arg int64_t arg3 @ r8
0x14001ecb0      48894c2408     mov qword [var_8h], rcx    ; arg1
0x14001ecb5      4889542418     mov qword [var_18h], rdx    ; arg2
0x14001ecba      4489442410     mov dword [var_10h], r8d    ; arg3
0x14001ecbf      49c7c1200593.  mov r9, 0x19930520
│       ┌─< 0x14001ecc6      eb08           jmp 0x14001ecd0
0x14001ecc8      cc             int3
0x14001ecc9      cc             int3
0x14001ecca      cc             int3
0x14001eccb      cc             int3
0x14001eccc      cc             int3
0x14001eccd      cc             int3
0x14001ecce      6690           nop
│       │   ; CODE XREF from fcn.14001ecb0 @ 0x14001ecc6
└       └─> 0x14001ecd0      c3             ret
            0x14001ecd1      cc             int3

Functions with less 3 instructions

Here's an example of a series of "functions" defined in the ground truth. All these functions set edx to a value then jump to the remainder of the function, which is also defined as a function in the ground truth. Should all these 'thunks' be defined as functions? If so, what is their ground truth end? Currently, it is defined as the first byte after the jmp.

ground truth:

 140017b3c 140017b46
 140017b48 140017b52
 140017b54 140017b5e
 140017b60 140017b6a
 140017b6c 140017b76
 140017b78 140017b82
 140017b84 140017b8e
 140017b90 140017b9a
 140017b9c 140017ba6
 140017ba8 140017bb2
 140017bb4 140017bbe
 140017bc0 140017bca
 140017bcc 140017bd6
 140017bd8 140017be2
 140017be4 140017bee
 140017bf0 140017bfa
 140017bfc 140017c06
 140017c08 140017c12

radare2 output:

  ╎╎╎╎╎╎╎   0x140017b3c      ba03010000     mov edx, 0x103             ; 259
  ────────< 0x140017b41      e9725bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b46      cc             int3
  ╎╎╎╎╎╎╎   0x140017b47      cc             int3
  ╎╎╎╎╎╎╎   0x140017b48      ba01000000     mov edx, 1
  ────────< 0x140017b4d      e9665bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b52      cc             int3
  ╎╎╎╎╎╎╎   0x140017b53      cc             int3
  ╎╎╎╎╎╎╎   0x140017b54      ba01000000     mov edx, 1
  ────────< 0x140017b59      e95a5bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b5e      cc             int3
  ╎╎╎╎╎╎╎   0x140017b5f      cc             int3
  ╎╎╎╎╎╎╎   0x140017b60      ba02000000     mov edx, 2
  ────────< 0x140017b65      e94e5bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b6a      cc             int3
  ╎╎╎╎╎╎╎   0x140017b6b      cc             int3
  ╎╎╎╎╎╎╎   0x140017b6c      ba02000000     mov edx, 2
  ────────< 0x140017b71      e9425bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b76      cc             int3
  ╎╎╎╎╎╎╎   0x140017b77      cc             int3
  ╎╎╎╎╎╎╎   0x140017b78      ba04000000     mov edx, 4
  ────────< 0x140017b7d      e9365bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b82      cc             int3
  ╎╎╎╎╎╎╎   0x140017b83      cc             int3
  ╎╎╎╎╎╎╎   0x140017b84      ba04000000     mov edx, 4
  ────────< 0x140017b89      e92a5bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b8e      cc             int3
  ╎╎╎╎╎╎╎   0x140017b8f      cc             int3
  ╎╎╎╎╎╎╎   0x140017b90      ba80000000     mov edx, 0x80              ; 128
  ────────< 0x140017b95      e91e5bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017b9a      cc             int3
  ╎╎╎╎╎╎╎   0x140017b9b      cc             int3
  ╎╎╎╎╎╎╎   0x140017b9c      ba80000000     mov edx, 0x80              ; 128
  ────────< 0x140017ba1      e9125bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017ba6      cc             int3
  ╎╎╎╎╎╎╎   0x140017ba7      cc             int3
  ╎╎╎╎╎╎╎   0x140017ba8      ba08000000     mov edx, 8
  ────────< 0x140017bad      e9065bffff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017bb2      cc             int3
  ╎╎╎╎╎╎╎   0x140017bb3      cc             int3
  ╎╎╎╎╎╎╎   0x140017bb4      ba08000000     mov edx, 8
  ────────< 0x140017bb9      e9fa5affff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017bbe      cc             int3
  ╎╎╎╎╎╎╎   0x140017bbf      cc             int3
  ╎╎╎╎╎╎╎   0x140017bc0      ba10000000     mov edx, 0x10              ; 16
  ────────< 0x140017bc5      e9ee5affff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017bca      cc             int3
  ╎╎╎╎╎╎╎   0x140017bcb      cc             int3
  ╎╎╎╎╎╎╎   0x140017bcc      ba10000000     mov edx, 0x10              ; 16
  ────────< 0x140017bd1      e9e25affff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017bd6      cc             int3
  ╎╎╎╎╎╎╎   0x140017bd7      cc             int3
  ╎╎╎╎╎╎╎   0x140017bd8      ba07010000     mov edx, 0x107             ; 263
  ────────< 0x140017bdd      e9d65affff     jmp fcn.14000d6b8
  ╎╎╎╎╎╎╎   0x140017be2      cc             int3
  ╎╎╎╎╎╎╎   0x140017be3      cc             int3
  ╎╎╎╎╎╎╎   0x140017be4      ba07010000     mov edx, 0x107             ; 263
  └───────< 0x140017be9      e9ca5affff     jmp fcn.14000d6b8
   ╎╎╎╎╎╎   0x140017bee      cc             int3
   ╎╎╎╎╎╎   0x140017bef      cc             int3
   ╎╎╎╎╎╎   0x140017bf0      ba57010000     mov edx, 0x157             ; 343
   └──────< 0x140017bf5      e9be5affff     jmp fcn.14000d6b8
    ╎╎╎╎╎   0x140017bfa      cc             int3
    ╎╎╎╎╎   0x140017bfb      cc             int3
    ╎╎╎╎╎   0x140017bfc      ba57010000     mov edx, 0x157             ; 343
    └─────< 0x140017c01      e9b25affff     jmp fcn.14000d6b8
     ╎╎╎╎   0x140017c06      cc             int3
     ╎╎╎╎   0x140017c07      cc             int3
     ╎╎╎╎   0x140017c08      ba17010000     mov edx, 0x117             ; 279
     └────< 0x140017c0d      e9a65affff     jmp fcn.14000d6b8
      ╎╎╎   0x140017c12      cc             int3
      ╎╎╎   0x140017c13      cc             int3
      ╎╎╎   0x140017c14      ba17010000     mov edx, 0x117             ; 279
      └───< 0x140017c19      e99a5affff     jmp fcn.14000d6b8
       ╎╎   0x140017c1e      cc             int3
       ╎╎   0x140017c1f      cc             int3
       ╎╎   0x140017c20      ba20000000     mov edx, 0x20              ; 32
       └──< 0x140017c25      e98e5affff     jmp fcn.14000d6b8
        ╎   0x140017c2a      cc             int3
        ╎   0x140017c2b      cc             int3
        ╎   0x140017c2c      ba20000000     mov edx, 0x20              ; 32
        └─< 0x140017c31      e9825affff     jmp fcn.14000d6b8
            0x140017c36      cc             int3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment