When a debugger, profiler or crash reporter is unwinding the call stack, can it reliably retrieve the function arguments of every function in the stack?
I originally intended for this to be a later part of the [Sampling Profilers]({{< ref "/post/2018/sampling-profiler-internals-introduction" >}}) series, but a recent discussion with Ben Frederickson, and his subsequent py-spy implementation helped crystallize my thoughts and I figured I'd write it down separately.
Retrieving function arguments is "trivial" in certain cases and pure guesswork in others. I am going to dive into why, and outline the situations from easiest to hardest.
For most people, this need is apparent in the context of a debugger. The exact arguments are usually the reason for a program failing, so debuggers try their best to extract the arguments to a function.
(gdb) bt
#0 bar (bar_arg=28) at test_prog.c:5
#1 0x0000000000400576 in foo (foo_arg=5) at test_prog.c:10
#2 0x000000000040059e in main () at test_prog.c:16
Of course, if this is run on a release build:
(gdb) bt
#0 0x0000000000400534 in bar ()
#1 0x0000000000400576 in foo ()
#2 0x000000000040059e in main ()
I spent a few weeks in 2017 working on improving Python crash reporting. Since then, I've been fascinated with being able to gain information about managed languages, without modifying their interpreters, and without requiring any custom software modifications.
For a crash reporting tool, having access to the arguments would let it collect useful debug info in the wild. You can imagine extending a tool like Crashpad to identify arguments with certain types and annotate the crash report with pretty printed information about those types, so that the core dump contains this information. Instead of using auxiliary information from the Python interpreter to derive the execution context as we did, one could simply walk the native stack, and interleave the Python stack at every PyEval_EvalFrameEx call in a deterministic way.
Similarly, there is a dearth of cross-language profilers. The well known tools like perf/BPF/dtrace are really meant for native code. Each interpreter of a higher level language usually has its own profiler that understands the language. I think it would be very cool to have a profiler that could sample native stacks, and when it detects a managed language, through some kind of plugin mechanism, infer what managed code was running as part of the profile gathering. You could then interleave stacks in the profile, showing hotspots across languages. So, if Python was slow because 50% of the time was spent in Python, but the other 50% was spent in the C code, waiting for some resource acquisition, you could see both! There are complex desktop applications out there where having 3 languages in the same process is not uncommon, and it is currently difficult to get comprehensive profiles across them.
The post mostly focuses on Linux and macOS 64-bit. We will quickly look at x86 where it differs. Windows is very similar in most respects. ARM is not much different when it comes to using registers for argument passing so similar concepts apply.
Assembly/DWARF output is from:
clang version 3.8.1-24 (tags/RELEASE_381/final)
Target: x86_64-pc-linux-gnu
This post assumes basic familiarity with assembly language, registers and the stack, function call frames and the concept of unwinding.
There is example code that I will refer to throughout the rest of the post.
The test_prog.c
file is compiled into several different executables (debug
,
debug_opt
, …..). Look at the Makefile for the precise build configuration.
#include <stdio.h>
void bar(int bar_arg)
{
printf("The number is %d\n", bar_arg);
}
void foo(int foo_arg)
{
bar(foo_arg + 23);
printf("unrelated\n");
}
int main()
{
foo(5);
}
In the call stack below, foo
is the caller and bar
is the callee.
foo()
-> bar()
All function arguments are assumed to be integers or pointers that fit in a single register.
On x86, arguments are pushed onto the stack in reverse order, followed by the
return address (saved eip
). The callee can access them by indexing from ebp
.
On x86-64, arguments are passed in rdi
, rsi
, rdx
and rcx
and a few other
registers, in that order.
Finally, unless optimizations are enabled, ebp
/rbp
delineates frames. This will
become useful later.
The debugger or other tool usually suspends the thread of interest and starts the unwinding process to retrieve the call stack. This is a comprehensive topic by itself, and I have a work in progress post about that. Here I assume that we can somehow retrieve the stack frame of the function who argument we want to retrieve.
This is the easiest case, and debuggers can always show arguments. This assumes
that the executable is built with debug information (-g
switch on gcc and
clang). On Linux and Mac, the DWARF format is used. The debug information
is stored in a section .debug_info
. This debug information is pretty
comprehensive, detailing for the debugger, the locations of functions,
arguments and stack variables.
In the example, here is the relevant information to retrieve bar_arg
, the
first argument for function bar
.
(obtained via dwarfdump debug
)
< 2><0x0000003f> DW_TAG_formal_parameter
>>>> DW_AT_location len 0x0002: 917c: DW_OP_fbreg -4
DW_AT_name bar_arg
DW_AT_decl_file 0x00000001 /home/nsmnikhil/unwind-arguments/test_prog.c
DW_AT_decl_line 0x00000003
DW_AT_type <0x0000008b>
It tells the debugger exactly where the argument is stored. In this case, there
is some collusion between the compiler and the debugger to simplify things. If
we look at the disassembly (objdump -M intel -d debug
)
0000000000400530 <bar>:
400530: 55 push rbp
400531: 48 89 e5 mov rbp,rsp
400534: 48 83 ec 10 sub rsp,0x10
400538: 48 b8 34 06 40 00 00 movabs rax,0x400634
40053f: 00 00 00
400542: 89 7d fc >>>> mov DWORD PTR [rbp-0x4],edi <<<<
400545: 8b 75 fc mov esi,DWORD PTR [rbp-0x4]
400548: 48 89 c7 mov rdi,rax
40054b: b0 00 mov al,0x0
40054d: e8 ce fe ff ff call 400420 <printf@plt>
400552: 89 45 f8 mov DWORD PTR [rbp-0x8],eax
400555: 48 83 c4 10 add rsp,0x10
400559: 5d pop rbp
40055a: c3 ret
40055b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
After the prologue, the compiler simply uses 4 bytes of stack space to stash
edi
(low 4 bits of rdi
) and emits the DWARF information indicating that
bar_arg
can be found at DW_OP_fbreg -4
, and fbreg
is, of course, rbp
.
This is nice because the argument never moves around, regardless of where in the function the debugger is stopped.
DWARF is a flexible enough format to represent all kinds of transformations to indicate the memory locations of identifiers.
As long as debug information is enabled, even in optimized builds, the debugger can retrieve arguments at any point in the function. The compiler will simply emit more DWARF as it moves data around throughout the function.
Here is the disassembly for debug_opt
, which is compiled with -O2
, and the
DWARF for bar_arg
:
0000000000400570 <bar>:
400570: 89 f9 mov ecx,edi
400572: bf 44 06 40 00 mov edi,0x400644
400577: 31 c0 xor eax,eax
400579: 89 ce mov esi,ecx
40057b: e9 e0 fe ff ff jmp 400460 <printf@plt>
0000000000400580 <foo>:
400580: 50 push rax
400581: 8d 77 17 lea esi,[rdi+0x17]
400584: bf 44 06 40 00 mov edi,0x400644
400589: 31 c0 xor eax,eax
40058b: e8 d0 fe ff ff call 400460 <printf@plt>
400590: bf 56 06 40 00 mov edi,0x400656
400595: 58 pop rax
400596: e9 b5 fe ff ff jmp 400450 <puts@plt>
40059b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
00000000004005a0 <main>:
4005a0: 50 push rax
4005a1: bf 44 06 40 00 mov edi,0x400644
4005a6: be 1c 00 00 00 mov esi,0x1c
4005ab: 31 c0 xor eax,eax
4005ad: e8 ae fe ff ff call 400460 <printf@plt>
4005b2: bf 56 06 40 00 mov edi,0x400656
4005b7: e8 94 fe ff ff call 400450 <puts@plt>
4005bc: 31 c0 xor eax,eax
4005be: 59 pop rcx
4005bf: c3 ret
< 2><0x0000004f> DW_TAG_formal_parameter
DW_AT_name bar_arg
DW_AT_decl_file 0x00000001 /home/nsmnikhil/unwind-arguments/test_prog.c
DW_AT_decl_line 0x00000003
DW_AT_type <0x0000005b>
What's going on? Well, the compiler did several different optimizations.
- Both
foo
andbar
were actually inlined intomain
. - Constant optimization was done, so 5 + 23 was substituted with 28 (
0x1c
) and passed to printf as argument 2. - No prologue and epilogue was generated for the inlined functions.
The DWARF Debugging Inforamtion Entry (DIE) doesn't say anything about the
location of bar_arg. The lack of a DW_AT_location
indicates we have to look
elsewhere. This is not a DWARF tutorial, so I'll skip the details. We have to
instead look at the DWARF information for main
:
< 4><0x000000e5> DW_TAG_formal_parameter
DW_AT_const_value 0x0000001c
DW_AT_abstract_origin <0x0000004f>
where the formal parameter to the now-inlined bar is stated to have a constant value!
What if we had a more complicated program? The compiler would emit a location list for the argument, which describes a mapping from instruction pointer ranges, to offsets from various memory locations/registers where the argument can be accessed at any point.
So, as you can see, it is really easy to retrieve arguments when debug information is available. If you are not using a debugger, you will need to ship a DWARF parser. Fortunately there are several options.
DWARF is a complex format, so various people have implemented custom parsers for very specific use cases. PLCrashReporter has one for unwinding when the program is crashed, and restricted in memory allocations. Mozilla wrote their own LUL optimized for unwinding in a profiler. Finally, if you are writing assembly by hand, writing DWARF information for the assembly is a good way to assist debuggers.
The value of debug information cannot be overstated. Sure, we had to parse DWARF, and that is non-trivial, but it gave all the answers! Life is going to get pretty miserable without it.
With no precise description of arguments, we are going to rely on our knowledge of the stack and registers. First the slightly easy one, x86.
On 32-bit systems, the conventions differ among operating systems, but
generally, at least some arguments are on the stack. A function foo()
that
wants to call bar(int, int, int)
first pushes the 3 integers onto the stack,
then calls bar
. The call instruction will push the current eip
(which now
points past the call
instruction) on the stack, indicating the return address.
We can leverage that this will hold true regardless of optimizations. Of course, this is not a generic solution, since we don't know how many arguments are going to be present. If we are looking for specific functions and know their signatures, this can work.
To retrieve arguments, we need to reliably find frames. On x86, these are
delineated by ebp
pushed onto the stack in the prologue of each function, and
then being set to the base of that function's frame.
N-1th arg
----------
...
----------
0th arg
----------
ret
----------
old ebp
---------- <- ebp
…..
---------- <- esp
The very first (currently executing function's) frame can be retrieved by querying the OS for a thread "context". I have a separate post in-progress about the specific mechanics, so I won't cover them here.
With the ebp
known, we can start scanning from ebp+8
and perform the
manipulations we need based on the argument type. For an integer or pointer
type, ebp+8
is the first argument, ebp+12
is the next and so on. The fact
that the stack is immutable except for the currently executing function allows
us to have strong guarantees about these arguments remaining where they are.
There are certain gotchas. Leaf functions may not always have ebp
based
indexing. In addition, we still haven't gotten to determining the name of the
executing function, in case you only want the arguments for specific functions.
All that gets complicated quickly, and is a topic for another post.
Ironically, the most common setup on production systems - x86-64 in release mode
- is also the trickiest. I think I can say with reasonable certainty that naive extraction of arguments is very difficult. Even if one knows which function one is looking for and what its arguments look like. At the minimum one would need a disassembler, and some kind of register analysis to track moves. Let's understand why we end up in this situation.
First, we need the unwind information, to determine the frame boundaries.
Fortunately, unwinding information is always present in ELF and Mach-O, because
non-debugger tools like profilers and crash reporters need it. The most well
known use case is the backtrace()
function. In addition, certain languages
like C++ use it for exception handling. This information is a subset of DWARF
that has enough information to identify frame boundaries and the values of
certain registers. Libraries like libunwind
use this. It is well documented
in the ABI (Section 3.7 Stack Unwind Algorithm).
On Linux, the .eh_frame
section is usually just extended DWARF (Use
dwarfdump -F <file>
to read the .eh_frame
section.). On Mac, it is usually
Compact Unwind Encoding.
This is the unwind information for bar
in the release
binary with no optimizations:
$ dwarfdump -F release
…..
< 2><0x00400530:0x0040055b><><cie offset 0x00000044::cie index 1><fde offset 0x00000070 length: 0x0000001c>
<eh aug data len 0x0>
0x00400530: <off cfa=08(r7) > <off r16=-8(cfa) >
0x00400531: <off cfa=16(r7) > <off r6=-16(cfa) > <off r16=-8(cfa) >
0x00400534: <off cfa=16(r6) > <off r6=-16(cfa) > <off r16=-8(cfa) >
…..
The register numbers are standardized. For x86-64 r6
is rbp
, r7
is rsp
and r16
is the return address. Ian Lance Taylor has a good explanation of
the .eh_frame
format if you want to understand this fully. The disassembly helps:
0000000000400530 <bar>:
400530: 55 push rbp
400531: 48 89 e5 mov rbp,rsp
400534: 48 83 ec 10 sub rsp,0x10
cfa refers to the Canonical Frame Address. On x86-64, this is the value of
rsp
in the previous frame. That is, right before the call
instruction in
the caller. You can see how the CFA moves as every instruction manipulates the stack.
Getting back to our goal, in this case we have just enough information to
reconstruct the call stack, but no direct references to the arguments. Could
we use the register locations? That is, figure out what value rdi
has at each
instruction? In this particular, certainly not.
Is it always like this? Initially, looking through libunwind's API there were
constants defined for all the registers. Can't one simply call
unw_get_reg()
with the right constant and get the value of rdi
?
One can try. If you run the unwind_rdi
program, you will quickly experience
disappointment. rdi
never changes!
$ ./unwind_rdi
The number is 28
ip = 4008f6, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 40091e, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 7e06b30732e1, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 4006aa, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 0, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
The number is 28
unrelated
Registers are segmented into callee-saved and caller-saved. As engineering
would have it, all the argument passing registers are caller saved.
unw_get_reg()
specifically says:
For ordinary stack frames, it is normally possible to access only the preserved (callee-saved) registers and frame-related registers (such as the stack-pointer). However, for signal frames (see unw_is_signal_frame(3)), it is usually possible to access all registers.
That's because an unwinder only really needs rip
and rsp
to determine the next frame.
We can look at some more unwind information from release
and spot that
callee-saved registers are indeed present sometimes.
< 5><0x004005b0:0x00400615><><cie offset 0x000000a4::cie index 1><fde offset 0x000000d0 length: 0x00000044>
29 │ <eh aug data len 0x0>
30 │ 0x004005b0: <off cfa=08(r7) > <off r16=-8(cfa) >
31 │ 0x004005b2: <off cfa=16(r7) > <off r15=-16(cfa) > <off r16=-8(cfa) >
32 │ 0x004005b4: <off cfa=24(r7) > <off r14=-24(cfa) > <off r15=-16(cfa) > <off r16=-8(cfa) >
33 │ 0x004005b9: <off cfa=32(r7) > <off r13=-32(cfa) > <off r14=-24(cfa) > <off r15=-16(cfa) > <off r16=-8(cfa) >
r13
-r15
, which map to the same registers on x86-64 are incrementally available.
It is unclear to me why callee-saved registers are sometimes required. One
reason seems to be that the next frame's CFA can be defined in terms of some of
them. The other is probably to allow exception handlers to run code, while
maintaining the guarantee that callee-saved registers will have been restored.
As engineering would have it, none of the argument passing registers are callee-saved :(
Surely, we have enough information in the currently executing frame about every
single register, right? We did get an initial value for rdi
. Since the
function is executing, all registers are available. The platform specific
context retrieval methods -- either the signal handler on Linux, or
thread_get_context()
on macOS -- will give you rdi
, rsi
and friends.
But remember, these do not always map to the actual arguments! That assertion
only holds at function entry. In a debugger, you could deliberately stop
at the function entry, and retrieve the arguments.
Unfortunately, crash reporters and profilers are out of luck since they stop at
arbitrary instructions. rdi
could already have been discarded, or
overwritten by this point. The function could've called another function, at
which point they were lost. One would need a tool that could parse the machine
code, keep track of the argument registers being moved around at each step and
then potentially reconstruct the memory location. This assumes that no
overwriting has occurred.
At this point, I've certainly given up trying to find the arguments without modifying the code!
If we are willing to modify code, either at runtime, or by compiling it with extra plugins, then we can make some progress towards this task.
One option to get the arguments is to use information directly from the managed
language's runtime. For example, in Python, one can independently retrieve the
list of PyFrameObject
s, as done by Dropbox's crashpad work and by py-spy.
These frames have a 1:1 mapping to the first argument for every
PyEval_EvalFrameDefault
function. The PyEval_EvalFrameDefault
function is
usually exported in the public symbol table if Python is linked as a dynamic
library. When interleaving stacks, it is easy to determine when that function
is being executed. After doing a full unwind, py-spy can also collect the
PyFrameObject
list. Every time PyEval_EvalFrameDefault
is encountered, it
can interleave the python context into the native stack.
This is another deterministic, but involved option. Again, it usually only works for retrieving arguments for specific functions of interest.
A trampoline is a piece of code that we insert at runtime, that will replace a function's entry point. That is, we dynamically create a new function out of thin air and redirect various pointers so that this function is called instead of the original. The custom function can now stash away the arguments into specific places (either the stack or callee saved registers). Then it jumps back into the code for the original function.
The profiler will do this replacement in the running program's address space when it starts up. Then, at unwind time, when it encounters this trampoline, it knows exactly where to look for the arguments.
The downside is that the trampoline is platform and OS specific. vmprof has a nice sketch if you'd like to read more about this.
Some languages make this process easier. Since Python 3.6, one can write a standard C extension, that can replace the default evaluation function with a custom function, that can again stash the arguments. This simplifies the trampoline setup. It is described in PEP-523.
Neither of these options are universal, and instrumenting every function like that would have some serious performance costs. They may make sense for a program under test, but not for in-the-wild augmentation for post-facto collection by a crash reporter.
We've seen that the ability to retrieve function arguments lies on a spectrum.
I hope this post gave some insight into the complexity of the problem. I'm
thankful that compiler vendors provide comprehensive information in debug
builds. Without that, a lot of problems would be impossible to solve. At
least for now, the answers are not clear for binaries shipped to users. For
profilers, there are ways out of the problem, even if they are not easy or
generic. Crash reporters will just have to live without deterministic
information. Perhaps some "extended unwind info" or compiler modes that
regularly stash arguments onto the stack even in release builds (-Oargs
?)
is necessary.