Which other missing feature in gcc would make tracing function calls in userspace C/C++ applications easier
Let's say we want to compile a larger C/C++ application with 80,000 unique functions so that at run-time the function call-tree is output for code comprehension and debugging purposes.
Here [1] various methods and implementation options were discussed including the possibility of modifying gcc itself.
However, modifying gcc to insert macros all over the code has an unknown level of difficulty... but it's probably more difficult than expected.
So here is a further idea which can work without modifying gcc, but which could in theory be faster at compile time if gcc would be modified to do the same thing!
[1] https://gist.github.com/simonhf/489b1f3353a0ac095df01a1ddb6ad0e7
- This idea is very similar to the
-finstrument-macros
and__FUNCTION_UNIQUE__
idea presented in [1], but possibly less complicated and more pragmatic :-) - Instead of the higher implementation risk idea of inserting macros every, and introducing a new and complicated macro
__FUNCTION_UNIQUE__
, the new idea works as follows: - The current gcc feature
-finstrument-macros
together with-O0
generates Intel assembler which looks like this:
movq 24(%rsp), %rsi <-- function parameter #2
leaq my_func(%rip), %rdi <-- function parameter #1
call __cyg_profile_func_enter@PLT
...
movq 24(%rsp), %rsi <-- function parameter #2
leaq my_func(%rip), %rdi <-- function parameter #1
call __cyg_profile_func_exit@PLT
- Note: For Linux, the first 6 function parameters get passed as registers as described here [2].
- The first parameter to function
__cyg_profile_func_(enter|exit)()
is the 64bit address of the function being instrumented, which is, of course, unique. - In C++, the function label would be the mangled name of the function, but still unique.
- We can invoke gcc using the exiting
-finstrument-functions
feature, and ask it to output assembler instead of outputting an object file. - We can then modify the output assembler and assembler it to an object file.
- The modified assembler would look like this:
movq 24(%rsp), %rsi <-- function parameter #2
leaq cwrap_data_my_func(%rip), %rdi <-- function parameter #1
movl cwrap_data_my_func(%rip), %edx <-- new: register edx holds function verbosity
cmpl cwrap_log_verbosity(%rip), %edx <-- new: compare with global verbosity
jl .L_cwrap_skip_cyg_profile_func_1 <-- new: skip call if verbosity too low
call __cyg_profile_func_enter@PLT
.L_cwrap_skip_cyg_profile_func_1:
...
movq 24(%rsp), %rsi <-- function parameter #2
leaq cwrap_data_my_func(%rip), %rdi <-- function parameter #1
movl cwrap_data_my_func(%rip), %edx <-- new: register edx holds function verbosity
cmpl cwrap_log_verbosity(%rip), %edx <-- new: compare with global verbosity
jl .L_cwrap_skip_cyg_profile_func_1 <-- new: skip call if verbosity too low
call __cyg_profile_func_exit@PLT
.L_cwrap_skip_cyg_profile_func_1:
- Register 'edx' is used because it would be used for the 3rd function parameter [2] if
__cyg_profile_func_(enter|exit)()
used a 3rd parameter. - In the above example
my_func
is converted tocwrap_data_my_func
. Instead of passing the unique address ofmy_func
, then the unique name is converted into a unique data structure. - The data structure
cwrap_data_my_func
hold lots of run-time info about the function including the verbosity which is always the first item, e.g.:
typedef struct CWRAP_DATA {
int verbosity;
int magic;
int calls;
int line;
const char * name_cpp; int len_name_cpp;
const char * name ; int len_name ;
const char * result ; int len_result ;
const char * params ; int len_params ;
const char * file ; int len_file ;
} __attribute__((aligned(32),packed)) CWRAP_DATA;
- A verbosity jump wrapper is placed around the call to
__cyg_profile_func_(enter|exit)()
which compares the verbosity for the function with the global verbosity. - The idea is that it's better for run-time performance using
if(){ call{} }
thancall{ if(){} }
. - Using the above mechanism then all C/C++ files on the project can be compiled to object files via modified assembler files.
- Finally at linking, the linker will list all the 'undefined references' e.g.
cwrap_data_my_func
. - At this point we can dynamically create the necessary C/C++ source, compile, and link it to resolve the undefined references, e.g.:
$ cat cwrap.c
...
CWRAP_DATA cwrap_data_my_func = {1, CWRAP_MAGIC, 0, 11, "my_func", 7, "my_func", 7, "", 0, "", 0, "my-example.cpp", 14};
...
[1] https://gist.github.com/simonhf/489b1f3353a0ac095df01a1ddb6ad0e7 [2] https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI
- The strategy appears to work fine for
-O0
and maybe for-O1
, but can easily fail for-O2
. Why? - This is because gcc generates different assembler code when functions get inlined, e.g.:
leaq main(%rip), %rdi
leaq inc_a(%rip), %rbp
...
call __cyg_profile_func_enter@PLT
...
movq %rbp, %rdi
call __cyg_profile_func_enter@PLT <-- inline function #1 enter via rbp
...
movq %rbp, %rdi
call __cyg_profile_func_exit@PLT <-- inline function #1 leave via rbp
...
leaq inc_b(%rip), %rdi
call __cyg_profile_func_enter@PLT <-- inline function #2 enter
...
leaq inc_b(%rip), %rdi
call __cyg_profile_func_exit@PLT <-- inline function #2 enter
...
leaq main(%rip), %rdi
call __cyg_profile_func_exit@PLT
- In the above example, two functions
inc_a()
andinc_b()
get inlined intomain()
. - While functions
main()
andinc_b()
continue to get the 64bit address of their functions directly loaded into registerrdi
, functioninc_a()
indirectly uses registerrbp
instead. - While it's possible to take this gcc pattern into consideration and get small test C/C++ projects compiled with
-O2
working, it's unknown how many other patterns exist, and how much of a problem this is.
- To avoid tussling with the gcc output assembler files, gcc could be modifed to generate the modified instructions directly.
- The new
-finstrument-functions-advanced
could be made available in addition to the exsting-finstrument-functions
option. - This means the compilation of C/C++, with the
-finstrument-functions-advanced
option, would happen without any outside meddling to 'instument the instrumentation'. - However, the user is still required to harvest the undefined reference during linking, dynamically create the necessary C/C++ source, and compile, and link it.
- Note:
-O3
is used during compilation to assembler. - Note: The script
trace.pl
is hard-coded to this particular example currently, but it shouldn't be that much effort to make it work with any C/C++ files being compiled or linked. - Note: There are a bunch of non-mangled C and mangled C++ function names.
- Note: The output below shows the progress of scanning the output assembler files. You can see how functions like
bar()
andbaz()
have been inlined, so the calls to__cyg_profile_func_(enter|exit)()
have also been inlined. - Note: A simple example function
cwrap_log_verbosity_set()
is provided to show how function verbosity levels can be manipulated at run-time if desired. - Note: Linking happens twice; once to gather the undefined references, and once to include the auto generated
cwrap.c
file containing the structs for each instrumented funciton. - Note: This example uses
printf()
to display the instrumentation, but in an production example output would likely go to a better performing, shared memory ring buffer instead. - Note: This example overrides
printf()
with a preprocessor definition purely to make the existing non-instrumentationprintf()
statements allign with the indentation of the instrumented function calls. - Note: An example
cwrap_log_stats()
function is provided to illustrate how the static data for each function can be used at run-time for further analysis, in this case to report before process exit, the number of instrumented calls displayed at run-time. - Note: The number of calls made so far to a particular function is also displayed in the example instrumentation output, e.g.
my_struct::~my_struct()
is called 4 times in total. - Note: In the example code provided, no attempt is made to collapse the output for short functions, e.g. instead of 2 instrumentation lines, only show
+ baz() {} // #1
.
$ ./doit.sh
compiling cpp-example-1.cpp to assembler
compiling cpp-example-2.cpp to assembler
munging cpp-example-1.s
- s_file=cpp-example-1.s
- s_file_lines=8665
- found __cyg_profile_func_(enter|exit}() calls: 58 in 0.000298s
- enter _Z7get_maxIiET_S0_S0_
- exit _Z7get_maxIiET_S0_S0_
- enter _Z7get_maxIcET_S0_S0_
- exit _Z7get_maxIcET_S0_S0_
- enter _Z7bye_bazv
- exit _Z7bye_bazv
- exit _Z7bye_bazv
- enter _Z3byev
- exit _Z3byev
- exit _Z3byev
- enter _ZN9my_structC2EPKc
- exit _ZN9my_structC2EPKc
- exit _ZN9my_structC2EPKc
- enter _ZN9my_structC2EPKcS1_
- exit _ZN9my_structC2EPKcS1_
- exit _ZN9my_structC2EPKcS1_
- enter _ZN9my_structD2Ev
- exit _ZN9my_structD2Ev
- enter _Z3bazi
- exit _Z3bazi
- enter _ZN3FooC2EPKc
- enter _ZN9my_structC2EPKcS1_
- exit _ZN9my_structC2EPKcS1_
- exit _ZN3FooC2EPKc
- exit _ZN9my_structC2EPKcS1_
- exit _ZN3FooC2EPKc
- enter _Z3bari
- enter _Z3bazi
- exit _Z3bazi
- exit _Z3bari
- enter _ZN3Foo10my_privateEi
- enter _Z3bari
- enter _Z3bazi
- exit _Z3bazi
- exit _Z3bari
- exit _ZN3Foo10my_privateEi
- enter _ZN3Foo9my_publicEi
- enter _ZN3Foo10my_privateEi
- enter _Z3bari
- enter _Z3bazi
- exit _Z3bazi
- exit _Z3bari
- exit _ZN3Foo10my_privateEi
- exit _ZN3Foo9my_publicEi
- enter _ZN3FooD2Ev
- enter _ZN9my_structD2Ev
- exit _ZN9my_structD2Ev
- exit _ZN3FooD2Ev
- enter main
- enter _Z7get_maxIiET_S0_S0_
- exit _Z7get_maxIiET_S0_S0_
- enter _Z7get_maxIcET_S0_S0_
- exit _Z7get_maxIcET_S0_S0_
- enter _ZN9my_structC2EPKc
- exit _ZN9my_structC2EPKc
- exit main
- exit main
- exit _ZN9my_structC2EPKc
- writing to: cpp-example-1.s.2.s
munging cpp-example-2.s
- s_file=cpp-example-2.s
- s_file_lines=3131
- found __cyg_profile_func_(enter|exit}() calls: 16 in 0.000112s
- enter _Z8clean_upPi
- exit _Z8clean_upPi
- exit _Z8clean_upPi
- enter _ZN9my_structD2Ev
- exit _ZN9my_structD2Ev
- enter _ZN9my_structC2EPKc
- exit _ZN9my_structC2EPKc
- exit _ZN9my_structC2EPKc
- enter _Z41__static_initialization_and_destruction_0ii
- exit _Z41__static_initialization_and_destruction_0ii
- enter _ZN9my_structC2EPKc
- exit _ZN9my_structC2EPKc
- exit _ZN9my_structC2EPKc
- exit _Z41__static_initialization_and_destruction_0ii
- enter _GLOBAL__sub_I_my_struct_1
- exit _GLOBAL__sub_I_my_struct_1
- writing to: cpp-example-2.s.2.s
assembling cpp-example-1.s.2.s
assembling cpp-example-2.s.2.s
compiling compiling cwrap.c for linking
- linking: gcc -g -o cpp-example cpp-example-1.o cpp-example-2.o -lstdc++ 2>&1
- writing 17 missing cwrap structs to: cwrap.c
- _Z7bye_bazv bye_baz()
- _Z3byev bye()
- _Z3bazi baz(int)
- _Z3bari bar(int)
- _Z7get_maxIiET_S0_S0_ int get_max<int>(int, int)
- _Z7get_maxIcET_S0_S0_ char get_max<char>(char, char)
- main main
- _ZN9my_structC2EPKc my_struct::my_struct(char const*)
- _ZN9my_structC2EPKcS1_ my_struct::my_struct(char const*, char const*)
- _ZN9my_structD2Ev my_struct::~my_struct()
- _ZN3FooC2EPKc Foo::Foo(char const*)
- _ZN3Foo10my_privateEi Foo::my_private(int)
- _ZN3Foo9my_publicEi Foo::my_public(int)
- _ZN3FooD2Ev Foo::~Foo()
- _Z8clean_upPi clean_up(int*)
- _Z41__static_initialization_and_destruction_0ii __static_initialization_and_destruction_0(int, int)
- _GLOBAL__sub_I_my_struct_1 _GLOBAL__sub_I_my_struct_1
- linking: gcc -g -o cpp-example cwrap.c cpp-example-1.o cpp-example-2.o -lstdc++ 2>&1
running
+ _GLOBAL__sub_I_my_struct_1() {
+ __static_initialization_and_destruction_0() {
+ my_struct::my_struct() {
- constructing my_struct: arg1:my_struct_1
} // my_struct::my_struct() #1
} // __static_initialization_and_destruction_0() #1
} // _GLOBAL__sub_I_my_struct_1() #1
+ main() {
+ cwrap_log_verbosity_set(verbosity=0, name=bar) {
- i=0 .verbosity=1 .name=bye_baz
- i=1 .verbosity=1 .name=bye
- i=2 .verbosity=1 .name=baz
- i=3 .verbosity=1 .name=bar
- setting verbosity to zero
- i=4 .verbosity=1 .name=get_max<int>
- i=5 .verbosity=1 .name=get_max<char>
- i=6 .verbosity=1 .name=main
- i=7 .verbosity=1 .name=my_struct::my_struct
- i=8 .verbosity=1 .name=my_struct::my_struct
- i=9 .verbosity=1 .name=my_struct::~my_struct
- i=10 .verbosity=1 .name=Foo::Foo
- i=11 .verbosity=1 .name=Foo::my_private
- i=12 .verbosity=1 .name=Foo::my_public
- i=13 .verbosity=1 .name=Foo::~Foo
- i=14 .verbosity=1 .name=clean_up
- i=15 .verbosity=1 .name=__static_initialization_and_destruction_0
- i=16 .verbosity=1 .name=_GLOBAL__sub_I_my_struct_1
} // cwrap_log_verbosity_set()
- hello world
+ baz() {
} // baz() #1
+ get_max<int>() {
} // get_max<int>() #1
+ get_max<char>() {
} // get_max<char>() #1
+ my_struct::my_struct() {
- constructing my_struct: arg1:my_struct_2
} // my_struct::my_struct() #2
+ Foo::Foo() {
+ my_struct::my_struct() {
- constructing my_struct: arg1:my_struct_3 arg2:a
} // my_struct::my_struct() #1
- constructing Foo
} // Foo::Foo() #1
+ Foo::Foo() {
+ my_struct::my_struct() {
- constructing my_struct: arg1:my_struct_3 arg2:b
} // my_struct::my_struct() #2
- constructing Foo
} // Foo::Foo() #2
+ Foo::my_public() {
+ Foo::my_private() {
+ baz() {
} // baz() #2
} // Foo::my_private() #1
} // Foo::my_public() #1
+ baz() {
} // baz() #3
+ Foo::~Foo() {
- deconstructing Foo
+ my_struct::~my_struct() {
- deconstructing my_struct: my_struct_3
} // my_struct::~my_struct() #1
} // Foo::~Foo() #1
+ Foo::~Foo() {
- deconstructing Foo
+ my_struct::~my_struct() {
- deconstructing my_struct: my_struct_3
} // my_struct::~my_struct() #2
} // Foo::~Foo() #2
+ my_struct::~my_struct() {
- deconstructing my_struct: my_struct_2
} // my_struct::~my_struct() #3
+ clean_up() {
- my_int=5
} // clean_up() #1
} // main() #1
+ bye_baz() {
- called via atexit() via baz()
} // bye_baz() #1
+ bye_baz() {
- called via atexit() via baz()
} // bye_baz() #2
+ bye_baz() {
- called via atexit() via baz()
} // bye_baz() #3
+ bye() {
- called via atexit() via main()
} // bye() #1
+ cwrap_log_stats() {
- 3 call(s) to bye_baz()
- 1 call(s) to bye()
- 3 call(s) to baz(int)
- 0 call(s) to bar(int)
- 1 call(s) to get_max<int>(int, int)
- 1 call(s) to get_max<char>(char, char)
- 1 call(s) to main()
- 2 call(s) to my_struct::my_struct(char const*)
- 2 call(s) to my_struct::my_struct(char const*, char const*)
- 3 call(s) to my_struct::~my_struct()
- 2 call(s) to Foo::Foo(char const*)
- 1 call(s) to Foo::my_private(int)
- 1 call(s) to Foo::my_public(int)
- 2 call(s) to Foo::~Foo()
- 1 call(s) to clean_up(int*)
- 1 call(s) to __static_initialization_and_destruction_0(int, int)
- 1 call(s) to _GLOBAL__sub_I_my_struct_1()
} // cwrap_log_stats()
+ my_struct::~my_struct() {
- deconstructing my_struct: my_struct_1
} // my_struct::~my_struct() #4