- Definition of function start and end.
- Functions with exception handlers.
- Functions like thunk functions (name?).
These examples are from the ByteWeight dataset using the file bap-dataset/pe-x86-64/binary/msvs_whatever_64_O2_vim
.
The function bounds provided (ground truth) are:
START END
14002056c 1400205ad
1400205b0 1400205c8
1400205c8 1400206f8
1400206f8 140020700
140020700 1400208cf
The END
is the first byte after the last instruction of the fuction.
In cases where there is no gap between functions, FN[n].END == FN[n+1].START
.
This is true for the function that starts @ 0x1400206f8.
In the example binary (vim
) there are 302 occurances where end == start, all of these default to labeling the first byte of a function as a function end.
Unfortunately, if you chose to label each byte with one label, you cannot actually represent ground truth accurately unless you mark the function end as the last byte of the last instruction.
- FunctionStart (F)
- FunctionEnd (E)
- FunctionBody
140020592: 3d 22 05 93 19 cmp eax,0x19930522
140020597: 74 07 je 0x1400205a0
140020599: 3d 00 40 99 01 cmp eax,0x1994000
14002059e: 75 06 jne 0x1400205a6
1400205a0: e8 d7 b3 ff ff call 0x14001b97c
1400205a5: cc int3
1400205a6: 33 c0 xor eax,eax
1400205a8: 48 83 c4 28 add rsp,0x28
1400205ac: c3 ret
E 1400205ad: cc int3
1400205ae: cc int3
1400205af: cc int3
F 1400205b0: 48 83 ec 28 sub rsp,0x28
1400205b4: 48 8d 0d b1 ff ff ff lea rcx,[rip+0xffffffffffffffb1] # 0x14002056c
1400205bb: ff 15 c7 1f 1e 00 call QWORD PTR [rip+0x1e1fc7] # 0x140202588
1400205c1: 33 c0 xor eax,eax
1400205c3: 48 83 c4 28 add rsp,0x28
1400205c7: c3 ret
EF 1400205c8: 48 89 5c 24 08 mov QWORD PTR [rsp+0x8],rbx
1400205cd: 48 89 6c 24 10 mov QWORD PTR [rsp+0x10],rbp
1400205d2: 48 89 74 24 18 mov QWORD PTR [rsp+0x18],rsi
1400205d7: 57 push rdi
1400205d8: 48 83 ec 30 sub rsp,0x30
1400205dc: 83 3d c5 70 25 00 00 cmp DWORD PTR [rip+0x2570c5],0x0 # 0x1402776a8
1400205e3: 75 05 jne 0x1400205ea
1400205e5: e8 0e 41 ff ff call 0x1400146f8
The same snippet from radare2:
SHOW
││││ 0x140020592 3d22059319 cmp eax, 0x19930522
┌─────< 0x140020597 7407 je 0x1400205a0
│││││ 0x140020599 3d00409901 cmp eax, 0x1994000
┌──────< 0x14002059e 7506 jne 0x1400205a6
││││││ ; CODE XREFS from fcn.140020518 @ +0x71, +0x78, +0x7f
│└└└───> 0x1400205a0 e8d7b3ffff call fcn.14001b97c ;[1]
│ ││ 0x1400205a5 cc int3
│ ││ ; CODE XREFS from fcn.140020518 @ +0x61, +0x67, +0x86
└───└└─> 0x1400205a6 33c0 xor eax, eax
0x1400205a8 4883c428 add rsp, 0x28
0x1400205ac c3 ret
0x1400205ad cc int3
0x1400205ae cc int3
0x1400205af cc int3
0x1400205b0 4883ec28 sub rsp, 0x28
0x1400205b4 488d0db1ffff. lea rcx, [0x14002056c]
0x1400205bb ff15c71f1e00 call qword [sym.imp.KERNEL32.dll_SetUnhandledExceptionFilter] ;[2] ; [0x140202588:8]=0x230436 reloc.KERNEL32.dll_SetUnhandledExceptionFilter ; "6\x04#" ; LPTOP_LEVEL_EXCEPTION_FILTER SetUnhandledExceptionFilter(LPTOP_LEVEL_EXCEPTION_FILTER lpTopLevelExceptionFilter)
0x1400205c1 33c0 xor eax, eax
0x1400205c3 4883c428 add rsp, 0x28
0x1400205c7 c3 ret
; CALL XREF from entry0 @ 0x14000e163
┌ 304: fcn.1400205c8 ();
│ ; var int64_t var_20h @ rsp+0x20
│ ; var int64_t var_8h @ rsp+0x40
│ ; var int64_t var_10h @ rsp+0x48
│ ; var int64_t var_18h @ rsp+0x50
│ 0x1400205c8 48895c2408 mov qword [var_8h], rbx
│ 0x1400205cd 48896c2410 mov qword [var_10h], rbp
│ 0x1400205d2 4889742418 mov qword [var_18h], rsi
│ 0x1400205d7 57 push rdi
│ 0x1400205d8 4883ec30 sub rsp, 0x30
│ 0x1400205dc 833dc5702500. cmp dword [0x1402776a8], 0 ; [0x1402776a8:4]=0
│ ┌─< 0x1400205e3 7505 jne 0x1400205ea
│ │ 0x1400205e5 e80e41ffff call fcn.1400146f8 ;[3]
IDA identifies certain code blocks as exception handlers and assigns them to a parent function. Radare2 does not identify these as functions.
Here is an example:
ground truth marks the block on the right as a function, which is resonable in this case.
START END
1402019b7 1402019d0
1402019d0 1402019eb
1402019eb 140201a06 <<==
140201a06 140201a21
140201a21 140201a3c
For an unknown reason, this example is defined as two functions. I don't know how or why you would define a function this way.
Provided ground truth in BAP:
START END
14001ecb0 14001ecc8
14001ecd0 14001ecd1
; CALL XREF from fcn.140013358 @ 0x140013439
┌ 25: fcn.14001ecb0 (int64_t arg1, int64_t arg2, int64_t arg3);
│ ; var int64_t var_8h @ rsp+0x8
│ ; var int64_t var_10h @ rsp+0x10
│ ; var int64_t var_18h @ rsp+0x18
│ ; arg int64_t arg1 @ rcx
│ ; arg int64_t arg2 @ rdx
│ ; arg int64_t arg3 @ r8
│ 0x14001ecb0 48894c2408 mov qword [var_8h], rcx ; arg1
│ 0x14001ecb5 4889542418 mov qword [var_18h], rdx ; arg2
│ 0x14001ecba 4489442410 mov dword [var_10h], r8d ; arg3
│ 0x14001ecbf 49c7c1200593. mov r9, 0x19930520
│ ┌─< 0x14001ecc6 eb08 jmp 0x14001ecd0
│ 0x14001ecc8 cc int3
│ 0x14001ecc9 cc int3
│ 0x14001ecca cc int3
│ 0x14001eccb cc int3
│ 0x14001eccc cc int3
│ 0x14001eccd cc int3
│ 0x14001ecce 6690 nop
│ │ ; CODE XREF from fcn.14001ecb0 @ 0x14001ecc6
└ └─> 0x14001ecd0 c3 ret
0x14001ecd1 cc int3
Here's an example of a series of "functions" defined in the ground truth.
All these functions set edx
to a value then jump to the remainder of the function, which is also defined as a function in the ground truth.
Should all these 'thunks' be defined as functions?
If so, what is their ground truth end?
Currently, it is defined as the first byte after the jmp
.
ground truth:
140017b3c 140017b46
140017b48 140017b52
140017b54 140017b5e
140017b60 140017b6a
140017b6c 140017b76
140017b78 140017b82
140017b84 140017b8e
140017b90 140017b9a
140017b9c 140017ba6
140017ba8 140017bb2
140017bb4 140017bbe
140017bc0 140017bca
140017bcc 140017bd6
140017bd8 140017be2
140017be4 140017bee
140017bf0 140017bfa
140017bfc 140017c06
140017c08 140017c12
radare2 output:
╎╎╎╎╎╎╎ 0x140017b3c ba03010000 mov edx, 0x103 ; 259
────────< 0x140017b41 e9725bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b46 cc int3
╎╎╎╎╎╎╎ 0x140017b47 cc int3
╎╎╎╎╎╎╎ 0x140017b48 ba01000000 mov edx, 1
────────< 0x140017b4d e9665bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b52 cc int3
╎╎╎╎╎╎╎ 0x140017b53 cc int3
╎╎╎╎╎╎╎ 0x140017b54 ba01000000 mov edx, 1
────────< 0x140017b59 e95a5bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b5e cc int3
╎╎╎╎╎╎╎ 0x140017b5f cc int3
╎╎╎╎╎╎╎ 0x140017b60 ba02000000 mov edx, 2
────────< 0x140017b65 e94e5bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b6a cc int3
╎╎╎╎╎╎╎ 0x140017b6b cc int3
╎╎╎╎╎╎╎ 0x140017b6c ba02000000 mov edx, 2
────────< 0x140017b71 e9425bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b76 cc int3
╎╎╎╎╎╎╎ 0x140017b77 cc int3
╎╎╎╎╎╎╎ 0x140017b78 ba04000000 mov edx, 4
────────< 0x140017b7d e9365bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b82 cc int3
╎╎╎╎╎╎╎ 0x140017b83 cc int3
╎╎╎╎╎╎╎ 0x140017b84 ba04000000 mov edx, 4
────────< 0x140017b89 e92a5bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b8e cc int3
╎╎╎╎╎╎╎ 0x140017b8f cc int3
╎╎╎╎╎╎╎ 0x140017b90 ba80000000 mov edx, 0x80 ; 128
────────< 0x140017b95 e91e5bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017b9a cc int3
╎╎╎╎╎╎╎ 0x140017b9b cc int3
╎╎╎╎╎╎╎ 0x140017b9c ba80000000 mov edx, 0x80 ; 128
────────< 0x140017ba1 e9125bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017ba6 cc int3
╎╎╎╎╎╎╎ 0x140017ba7 cc int3
╎╎╎╎╎╎╎ 0x140017ba8 ba08000000 mov edx, 8
────────< 0x140017bad e9065bffff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017bb2 cc int3
╎╎╎╎╎╎╎ 0x140017bb3 cc int3
╎╎╎╎╎╎╎ 0x140017bb4 ba08000000 mov edx, 8
────────< 0x140017bb9 e9fa5affff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017bbe cc int3
╎╎╎╎╎╎╎ 0x140017bbf cc int3
╎╎╎╎╎╎╎ 0x140017bc0 ba10000000 mov edx, 0x10 ; 16
────────< 0x140017bc5 e9ee5affff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017bca cc int3
╎╎╎╎╎╎╎ 0x140017bcb cc int3
╎╎╎╎╎╎╎ 0x140017bcc ba10000000 mov edx, 0x10 ; 16
────────< 0x140017bd1 e9e25affff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017bd6 cc int3
╎╎╎╎╎╎╎ 0x140017bd7 cc int3
╎╎╎╎╎╎╎ 0x140017bd8 ba07010000 mov edx, 0x107 ; 263
────────< 0x140017bdd e9d65affff jmp fcn.14000d6b8
╎╎╎╎╎╎╎ 0x140017be2 cc int3
╎╎╎╎╎╎╎ 0x140017be3 cc int3
╎╎╎╎╎╎╎ 0x140017be4 ba07010000 mov edx, 0x107 ; 263
└───────< 0x140017be9 e9ca5affff jmp fcn.14000d6b8
╎╎╎╎╎╎ 0x140017bee cc int3
╎╎╎╎╎╎ 0x140017bef cc int3
╎╎╎╎╎╎ 0x140017bf0 ba57010000 mov edx, 0x157 ; 343
└──────< 0x140017bf5 e9be5affff jmp fcn.14000d6b8
╎╎╎╎╎ 0x140017bfa cc int3
╎╎╎╎╎ 0x140017bfb cc int3
╎╎╎╎╎ 0x140017bfc ba57010000 mov edx, 0x157 ; 343
└─────< 0x140017c01 e9b25affff jmp fcn.14000d6b8
╎╎╎╎ 0x140017c06 cc int3
╎╎╎╎ 0x140017c07 cc int3
╎╎╎╎ 0x140017c08 ba17010000 mov edx, 0x117 ; 279
└────< 0x140017c0d e9a65affff jmp fcn.14000d6b8
╎╎╎ 0x140017c12 cc int3
╎╎╎ 0x140017c13 cc int3
╎╎╎ 0x140017c14 ba17010000 mov edx, 0x117 ; 279
└───< 0x140017c19 e99a5affff jmp fcn.14000d6b8
╎╎ 0x140017c1e cc int3
╎╎ 0x140017c1f cc int3
╎╎ 0x140017c20 ba20000000 mov edx, 0x20 ; 32
└──< 0x140017c25 e98e5affff jmp fcn.14000d6b8
╎ 0x140017c2a cc int3
╎ 0x140017c2b cc int3
╎ 0x140017c2c ba20000000 mov edx, 0x20 ; 32
└─< 0x140017c31 e9825affff jmp fcn.14000d6b8
0x140017c36 cc int3