When a debugger, profiler or crash reporter is unwinding the call stack, can it reliably retrieve the function arguments of every function in the stack?

I originally intended for this to be a later part of the [Sampling Profilers]({{< ref "/post/2018/sampling-profiler-internals-introduction" >}}) series, but a recent discussion with Ben Frederickson, and his subsequent py-spy implementation helped crystallize my thoughts and I figured I'd write it down separately.

Retrieving function arguments is "trivial" in certain cases and pure guesswork in others. I am going to dive into why, and outline the situations from easiest to hardest.

Why would you want access to the arguments?

For most people, this need is apparent in the context of a debugger. The exact arguments are usually the reason for a program failing, so debuggers try their best to extract the arguments to a function.

(gdb) bt
#0  bar (bar_arg=28) at test_prog.c:5
#1  0x0000000000400576 in foo (foo_arg=5) at test_prog.c:10
#2  0x000000000040059e in main () at test_prog.c:16

Of course, if this is run on a release build:

(gdb) bt
#0  0x0000000000400534 in bar ()
#1  0x0000000000400576 in foo ()
#2  0x000000000040059e in main ()

I spent a few weeks in 2017 working on improving Python crash reporting. Since then, I've been fascinated with being able to gain information about managed languages, without modifying their interpreters, and without requiring any custom software modifications.

For a crash reporting tool, having access to the arguments would let it collect useful debug info in the wild. You can imagine extending a tool like Crashpad to identify arguments with certain types and annotate the crash report with pretty printed information about those types, so that the core dump contains this information. Instead of using auxiliary information from the Python interpreter to derive the execution context as we did, one could simply walk the native stack, and interleave the Python stack at every PyEval_EvalFrameEx call in a deterministic way.

Similarly, there is a dearth of cross-language profilers. The well known tools like perf/BPF/dtrace are really meant for native code. Each interpreter of a higher level language usually has its own profiler that understands the language. I think it would be very cool to have a profiler that could sample native stacks, and when it detects a managed language, through some kind of plugin mechanism, infer what managed code was running as part of the profile gathering. You could then interleave stacks in the profile, showing hotspots across languages. So, if Python was slow because 50% of the time was spent in Python, but the other 50% was spent in the C code, waiting for some resource acquisition, you could see both! There are complex desktop applications out there where having 3 languages in the same process is not uncommon, and it is currently difficult to get comprehensive profiles across them.

Setting the stage

The post mostly focuses on Linux and macOS 64-bit. We will quickly look at x86 where it differs. Windows is very similar in most respects. ARM is not much different when it comes to using registers for argument passing so similar concepts apply.

Assembly/DWARF output is from:

clang version 3.8.1-24 (tags/RELEASE_381/final)
Target: x86_64-pc-linux-gnu

This post assumes basic familiarity with assembly language, registers and the stack, function call frames and the concept of unwinding.

There is example code that I will refer to throughout the rest of the post. The test_prog.c file is compiled into several different executables (debug, debug_opt, â€¦..). Look at the Makefile for the precise build configuration.

#include <stdio.h>

void bar(int bar_arg)
{
    printf("The number is %d\n", bar_arg);
}

void foo(int foo_arg)
{
    bar(foo_arg + 23);
    printf("unrelated\n");
}

int main()
{
    foo(5);
}

Argument passing and calling conventions.

In the call stack below, foo is the caller and bar is the callee.

foo()
  -> bar()

All function arguments are assumed to be integers or pointers that fit in a single register.

On x86, arguments are pushed onto the stack in reverse order, followed by the return address (saved eip). The callee can access them by indexing from ebp.

On x86-64, arguments are passed in rdi, rsi, rdx and rcx and a few other registers, in that order.

Finally, unless optimizations are enabled, ebp/rbp delineates frames. This will become useful later.

Unwinding the stack

The debugger or other tool usually suspends the thread of interest and starts the unwinding process to retrieve the call stack. This is a comprehensive topic by itself, and I have a work in progress post about that. Here I assume that we can somehow retrieve the stack frame of the function who argument we want to retrieve.

With debug information

This is the easiest case, and debuggers can always show arguments. This assumes that the executable is built with debug information (-g switch on gcc and clang). On Linux and Mac, the DWARF format is used. The debug information is stored in a section .debug_info. This debug information is pretty comprehensive, detailing for the debugger, the locations of functions, arguments and stack variables.

In the example, here is the relevant information to retrieve bar_arg, the first argument for function bar.

(obtained via dwarfdump debug)

< 2><0x0000003f>      DW_TAG_formal_parameter
                  >>>>  DW_AT_location              len 0x0002: 917c: DW_OP_fbreg -4
                        DW_AT_name                  bar_arg
                        DW_AT_decl_file             0x00000001 /home/nsmnikhil/unwind-arguments/test_prog.c
                        DW_AT_decl_line             0x00000003
                        DW_AT_type                  <0x0000008b>

It tells the debugger exactly where the argument is stored. In this case, there is some collusion between the compiler and the debugger to simplify things. If we look at the disassembly (objdump -M intel -d debug)

0000000000400530 <bar>:
  400530:   55                      push   rbp
  400531:   48 89 e5                mov    rbp,rsp
  400534:   48 83 ec 10             sub    rsp,0x10
  400538:   48 b8 34 06 40 00 00    movabs rax,0x400634
  40053f:   00 00 00
  400542:   89 7d fc        >>>>    mov    DWORD PTR [rbp-0x4],edi <<<<
  400545:   8b 75 fc                mov    esi,DWORD PTR [rbp-0x4]
  400548:   48 89 c7                mov    rdi,rax
  40054b:   b0 00                   mov    al,0x0
  40054d:   e8 ce fe ff ff          call   400420 <printf@plt>
  400552:   89 45 f8                mov    DWORD PTR [rbp-0x8],eax
  400555:   48 83 c4 10             add    rsp,0x10
  400559:   5d                      pop    rbp
  40055a:   c3                      ret
  40055b:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

After the prologue, the compiler simply uses 4 bytes of stack space to stash edi (low 4 bits of rdi) and emits the DWARF information indicating that bar_arg can be found at DW_OP_fbreg -4, and fbreg is, of course, rbp.

This is nice because the argument never moves around, regardless of where in the function the debugger is stopped.

What about optimizations?

DWARF is a flexible enough format to represent all kinds of transformations to indicate the memory locations of identifiers.

As long as debug information is enabled, even in optimized builds, the debugger can retrieve arguments at any point in the function. The compiler will simply emit more DWARF as it moves data around throughout the function.

Here is the disassembly for debug_opt, which is compiled with -O2, and the DWARF for bar_arg:

0000000000400570 <bar>:
  400570:   89 f9                   mov    ecx,edi
  400572:   bf 44 06 40 00          mov    edi,0x400644
  400577:   31 c0                   xor    eax,eax
  400579:   89 ce                   mov    esi,ecx
  40057b:   e9 e0 fe ff ff          jmp    400460 <printf@plt>

0000000000400580 <foo>:
  400580:   50                      push   rax
  400581:   8d 77 17                lea    esi,[rdi+0x17]
  400584:   bf 44 06 40 00          mov    edi,0x400644
  400589:   31 c0                   xor    eax,eax
  40058b:   e8 d0 fe ff ff          call   400460 <printf@plt>
  400590:   bf 56 06 40 00          mov    edi,0x400656
  400595:   58                      pop    rax
  400596:   e9 b5 fe ff ff          jmp    400450 <puts@plt>
  40059b:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

00000000004005a0 <main>:
  4005a0:       50                      push   rax
  4005a1:       bf 44 06 40 00          mov    edi,0x400644
  4005a6:       be 1c 00 00 00          mov    esi,0x1c
  4005ab:       31 c0                   xor    eax,eax
  4005ad:       e8 ae fe ff ff          call   400460 <printf@plt>
  4005b2:       bf 56 06 40 00          mov    edi,0x400656
  4005b7:       e8 94 fe ff ff          call   400450 <puts@plt>
  4005bc:       31 c0                   xor    eax,eax
  4005be:       59                      pop    rcx
  4005bf:       c3                      ret

< 2><0x0000004f>      DW_TAG_formal_parameter
                        DW_AT_name                  bar_arg
                        DW_AT_decl_file             0x00000001 /home/nsmnikhil/unwind-arguments/test_prog.c
                        DW_AT_decl_line             0x00000003
                        DW_AT_type                  <0x0000005b>

What's going on? Well, the compiler did several different optimizations.

Both foo and bar were actually inlined into main.
Constant optimization was done, so 5 + 23 was substituted with 28 (0x1c) and passed to printf as argument 2.
No prologue and epilogue was generated for the inlined functions.

The DWARF Debugging Inforamtion Entry (DIE) doesn't say anything about the location of bar_arg. The lack of a DW_AT_location indicates we have to look elsewhere. This is not a DWARF tutorial, so I'll skip the details. We have to instead look at the DWARF information for main:

< 4><0x000000e5>          DW_TAG_formal_parameter
                            DW_AT_const_value           0x0000001c
                            DW_AT_abstract_origin       <0x0000004f>

where the formal parameter to the now-inlined bar is stated to have a constant value!

What if we had a more complicated program? The compiler would emit a location list for the argument, which describes a mapping from instruction pointer ranges, to offsets from various memory locations/registers where the argument can be accessed at any point.

So, as you can see, it is really easy to retrieve arguments when debug information is available. If you are not using a debugger, you will need to ship a DWARF parser. Fortunately there are several options.

DWARF is a complex format, so various people have implemented custom parsers for very specific use cases. PLCrashReporter has one for unwinding when the program is crashed, and restricted in memory allocations. Mozilla wrote their own LUL optimized for unwinding in a profiler. Finally, if you are writing assembly by hand, writing DWARF information for the assembly is a good way to assist debuggers.

Without debug information

The value of debug information cannot be overstated. Sure, we had to parse DWARF, and that is non-trivial, but it gave all the answers! Life is going to get pretty miserable without it.

With no precise description of arguments, we are going to rely on our knowledge of the stack and registers. First the slightly easy one, x86.

x86 (32-bit)

On 32-bit systems, the conventions differ among operating systems, but generally, at least some arguments are on the stack. A function foo() that wants to call bar(int, int, int) first pushes the 3 integers onto the stack, then calls bar. The call instruction will push the current eip (which now points past the call instruction) on the stack, indicating the return address.

We can leverage that this will hold true regardless of optimizations. Of course, this is not a generic solution, since we don't know how many arguments are going to be present. If we are looking for specific functions and know their signatures, this can work.

To retrieve arguments, we need to reliably find frames. On x86, these are delineated by ebp pushed onto the stack in the prologue of each function, and then being set to the base of that function's frame.

 N-1th arg
----------
 ...
----------
 0th arg
----------
 ret
----------
 old ebp
---------- <- ebp

 â€¦..
---------- <- esp

The very first (currently executing function's) frame can be retrieved by querying the OS for a thread "context". I have a separate post in-progress about the specific mechanics, so I won't cover them here.

With the ebp known, we can start scanning from ebp+8 and perform the manipulations we need based on the argument type. For an integer or pointer type, ebp+8 is the first argument, ebp+12 is the next and so on. The fact that the stack is immutable except for the currently executing function allows us to have strong guarantees about these arguments remaining where they are.

There are certain gotchas. Leaf functions may not always have ebp based indexing. In addition, we still haven't gotten to determining the name of the executing function, in case you only want the arguments for specific functions. All that gets complicated quickly, and is a topic for another post.

x86-64

Ironically, the most common setup on production systems - x86-64 in release mode

is also the trickiest. I think I can say with reasonable certainty that naive extraction of arguments is very difficult. Even if one knows which function one is looking for and what its arguments look like. At the minimum one would need a disassembler, and some kind of register analysis to track moves. Let's understand why we end up in this situation.

First, we need the unwind information, to determine the frame boundaries. Fortunately, unwinding information is always present in ELF and Mach-O, because non-debugger tools like profilers and crash reporters need it. The most well known use case is the backtrace() function. In addition, certain languages like C++ use it for exception handling. This information is a subset of DWARF that has enough information to identify frame boundaries and the values of certain registers. Libraries like libunwind use this. It is well documented in the ABI (Section 3.7 Stack Unwind Algorithm).

On Linux, the .eh_frame section is usually just extended DWARF (Use dwarfdump -F <file> to read the .eh_frame section.). On Mac, it is usually Compact Unwind Encoding.

This is the unwind information for bar in the release binary with no optimizations:

$ dwarfdump -F release
â€¦..
 <    2><0x00400530:0x0040055b><><cie offset 0x00000044::cie index     1><fde offset 0x00000070 length: 0x0000001c>
       <eh aug data len 0x0>
        0x00400530: <off cfa=08(r7) > <off r16=-8(cfa) >
        0x00400531: <off cfa=16(r7) > <off r6=-16(cfa) > <off r16=-8(cfa) >
        0x00400534: <off cfa=16(r6) > <off r6=-16(cfa) > <off r16=-8(cfa) >
â€¦..

The register numbers are standardized. For x86-64 r6 is rbp, r7 is rsp and r16 is the return address. Ian Lance Taylor has a good explanation of the .eh_frame format if you want to understand this fully. The disassembly helps:

0000000000400530 <bar>:
  400530:       55                      push   rbp
  400531:       48 89 e5                mov    rbp,rsp
  400534:       48 83 ec 10             sub    rsp,0x10

cfa refers to the Canonical Frame Address. On x86-64, this is the value of rsp in the previous frame. That is, right before the call instruction in the caller. You can see how the CFA moves as every instruction manipulates the stack.

Getting back to our goal, in this case we have just enough information to reconstruct the call stack, but no direct references to the arguments. Could we use the register locations? That is, figure out what value rdi has at each instruction? In this particular, certainly not.

Is it always like this? Initially, looking through libunwind's API there were constants defined for all the registers. Can't one simply call unw_get_reg() with the right constant and get the value of rdi? One can try. If you run the unwind_rdi program, you will quickly experience disappointment. rdi never changes!

$ ./unwind_rdi
The number is 28
ip = 4008f6, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 40091e, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 7e06b30732e1, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 4006aa, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
ip = 0, rdi = 7fffd8315c30 rdi fetch success 0 signal? 0
The number is 28
unrelated

Registers are segmented into callee-saved and caller-saved. As engineering would have it, all the argument passing registers are caller saved. unw_get_reg() specifically says:

For ordinary stack frames, it is normally possible to access only the preserved (callee-saved) registers and frame-related registers (such as the stack-pointer). However, for signal frames (see unw_is_signal_frame(3)), it is usually possible to access all registers.

That's because an unwinder only really needs rip and rsp to determine the next frame. We can look at some more unwind information from release and spot that callee-saved registers are indeed present sometimes.

<    5><0x004005b0:0x00400615><><cie offset 0x000000a4::cie index     1><fde offset 0x000000d0 length: 0x00000044>
 29   â”‚        <eh aug data len 0x0>
 30   â”‚         0x004005b0: <off cfa=08(r7) > <off r16=-8(cfa) >
 31   â”‚         0x004005b2: <off cfa=16(r7) > <off r15=-16(cfa) > <off r16=-8(cfa) >
 32   â”‚         0x004005b4: <off cfa=24(r7) > <off r14=-24(cfa) > <off r15=-16(cfa) > <off r16=-8(cfa) >
 33   â”‚         0x004005b9: <off cfa=32(r7) > <off r13=-32(cfa) > <off r14=-24(cfa) > <off r15=-16(cfa) > <off r16=-8(cfa) >

r13-r15, which map to the same registers on x86-64 are incrementally available. It is unclear to me why callee-saved registers are sometimes required. One reason seems to be that the next frame's CFA can be defined in terms of some of them. The other is probably to allow exception handlers to run code, while maintaining the guarantee that callee-saved registers will have been restored.

As engineering would have it, none of the argument passing registers are callee-saved :(

The first frame

Surely, we have enough information in the currently executing frame about every single register, right? We did get an initial value for rdi. Since the function is executing, all registers are available. The platform specific context retrieval methods -- either the signal handler on Linux, or thread_get_context() on macOS -- will give you rdi, rsi and friends. But remember, these do not always map to the actual arguments! That assertion only holds at function entry. In a debugger, you could deliberately stop at the function entry, and retrieve the arguments.

Unfortunately, crash reporters and profilers are out of luck since they stop at arbitrary instructions. rdi could already have been discarded, or overwritten by this point. The function could've called another function, at which point they were lost. One would need a tool that could parse the machine code, keep track of the argument registers being moved around at each step and then potentially reconstruct the memory location. This assumes that no overwriting has occurred.

At this point, I've certainly given up trying to find the arguments without modifying the code!

Finding a way out

If we are willing to modify code, either at runtime, or by compiling it with extra plugins, then we can make some progress towards this task.

Using external info

One option to get the arguments is to use information directly from the managed language's runtime. For example, in Python, one can independently retrieve the list of PyFrameObjects, as done by Dropbox's crashpad work and by py-spy. These frames have a 1:1 mapping to the first argument for every PyEval_EvalFrameDefault function. The PyEval_EvalFrameDefault function is usually exported in the public symbol table if Python is linked as a dynamic library. When interleaving stacks, it is easy to determine when that function is being executed. After doing a full unwind, py-spy can also collect the PyFrameObject list. Every time PyEval_EvalFrameDefault is encountered, it can interleave the python context into the native stack.

Using a trampoline

This is another deterministic, but involved option. Again, it usually only works for retrieving arguments for specific functions of interest.

A trampoline is a piece of code that we insert at runtime, that will replace a function's entry point. That is, we dynamically create a new function out of thin air and redirect various pointers so that this function is called instead of the original. The custom function can now stash away the arguments into specific places (either the stack or callee saved registers). Then it jumps back into the code for the original function.

The profiler will do this replacement in the running program's address space when it starts up. Then, at unwind time, when it encounters this trampoline, it knows exactly where to look for the arguments.

The downside is that the trampoline is platform and OS specific. vmprof has a nice sketch if you'd like to read more about this.

Some languages make this process easier. Since Python 3.6, one can write a standard C extension, that can replace the default evaluation function with a custom function, that can again stash the arguments. This simplifies the trampoline setup. It is described in PEP-523.

Neither of these options are universal, and instrumenting every function like that would have some serious performance costs. They may make sense for a program under test, but not for in-the-wild augmentation for post-facto collection by a crash reporter.

Conclusion

We've seen that the ability to retrieve function arguments lies on a spectrum. I hope this post gave some insight into the complexity of the problem. I'm thankful that compiler vendors provide comprehensive information in debug builds. Without that, a lot of problems would be impossible to solve. At least for now, the answers are not clear for binaries shipped to users. For profilers, there are ways out of the problem, even if they are not easy or generic. Crash reporters will just have to live without deterministic information. Perhaps some "extended unwind info" or compiler modes that regularly stash arguments onto the stack even in release builds (-Oargs?) is necessary.

nikhilm/post.md