Skip to content

Instantly share code, notes, and snippets.

@CharlieQiu2017
Last active October 10, 2025 04:14
Show Gist options
  • Select an option

  • Save CharlieQiu2017/aa8374cea7b0033e2e2fcd7cd797c4fd to your computer and use it in GitHub Desktop.

Select an option

Save CharlieQiu2017/aa8374cea7b0033e2e2fcd7cd797c4fd to your computer and use it in GitHub Desktop.

Reading the C Language Frontend of GCC

Introduction

In this blog post we shall do some reading on the source code of GCC (https://github.com/gcc-mirror/gcc), specifically its C language front-end. There is a "GCC Internals" documentation (https://gcc.gnu.org/onlinedocs/gccint/) but it is described as "hopelessly outdated" (https://gotplt.org/posts/gcc-under-the-hood.html). Still, many parts of it are useful for navigating around the source tree. The slides available at https://www.cse.iitb.ac.in/grc/gcc-workshop-11/index.php are more modern. The source code contains detailed comments that try to explain what is going on, but many comments are likewise outdated. For example, there are many "FIXME" that date from decades ago. Another good source is https://thinkingeek.com/series/gcc-tiny/ which shows how to build a minimal front-end for GCC.

The best way to study the source code of GCC is to build a debug version of the compiler, provide it with simple inputs, and trace through its execution. We use the following configuration (adapted from https://gcc.gnu.org/wiki/DebuggingGCC) to build a debug version of GCC:

#!/bin/bash
export TARGET=aarch64-none-elf
export PREFIX=/opt/aarch64-none-elf-debug
export PROG_PREFIX=aarch64-none-elf-debug-
../gcc-14.2.0/configure \
--target=$TARGET \
--prefix=$PREFIX \
--program-prefix=$PROG_PREFIX \
--with-as=/opt/aarch64-none-elf/bin/aarch64-none-elf-as \
--with-ld=/opt/aarch64-none-elf/bin/aarch64-none-elf-ld \
--disable-bootstrap \
--without-headers \
--with-newlib \
--disable-analyzer \
--disable-cet \
--disable-decimal-float \
--disable-default-pie \
--disable-default-ssp \
--disable-fixed-point \
--disable-fixincludes \
--disable-gcov \
--disable-lto \
--disable-nls \
--disable-threads \
--disable-tls \
--disable-tm-clone-registry \
--disable-shared \
--disable-multilib \
--disable-libada \
--disable-libatomic \
--disable-libffi \
--disable-libgfortran \
--disable-libgm2 \
--disable-libgo \
--disable-libgomp \
--disable-libgrust \
--disable-libitm \
--disable-libobjc \
--disable-libphobos \
--disable-libquadmath \
--disable-libquadmath-support \
--disable-libsanitizer \
--disable-libssp \
--disable-libvtv \
--disable-hosted-libstdcxx \
--disable-checking \
--without-libatomic \
--without-libbacktrace \
--without-libdecnumber \
--without-isl \
--without-zstd \
--enable-languages=c,c++ \
C{,XX}FLAGS="-O0 -g"

When running make install after building, it seems necessary to use sudo --preserve-env=PATH make install, otherwise the installation script will complain it is not able to find aarch64-none-elf-ranlib.

The gcc executable is a small driver program that calls other programs to do the heavy work. These "other programs" include:

  • cpp: The C preprocessor, mostly implemented in a separate library called libcpp. However, unless the command-line option -no-integrated-cpp is present, gcc does not separately invoke cpp. Instead, the compiler internally calls libcpp to do the preprocessing work.
  • cc1 and cc1plus: The C/C++ compiler proper.
  • as (from binutils): The assembler.
  • ld (from binutils): The linker. Most of the time gcc does not invoke ld directly, but through a wrapper called collect2. Its purpose is documented in https://gcc.gnu.org/onlinedocs/gccint/Collect2.html.

In this blog post, we shall focus on the cc1 executable. The entrypoint of cc1 and cc1plus is the toplev::main() function in gcc/toplev.cc. The actions performed by toplev::main() are:

  1. Increase the stack limit to 64MiB.

  2. Call expandargv() in libiberty to handle so-called "response files". The purpose of this call is explained in https://gcc.gnu.org/wiki/Response_Files.

  3. Call general_init() in gcc/toplev.cc to initialize the GCC context. The most important actions of general_init() are:

    1. Initialize the diagnostic context global_dc. This structure is declared in gcc/diagnostic.h and defined in gcc/diagnostic-global-context.cc. Its type is class diagnostics::context, defined in gcc/diagnostics/context.h.

    2. Create a GCC context. This is a global variable simply called g. It is declared in gcc/context.h and defined in gcc/context.cc. The comment says:

      /* The global singleton context aka "g".
         (the name is chosen to be easy to type in a debugger).  */
      

      Its type is class gcc::context, defined in gcc/context.h. It simply contains a pass manager and a dump manager.

    3. Create a symbol table. This is a global variable called symtab. It is declared in gcc/cgraph.h and defined in gcc/cgraph.cc. Its type is class symbol_table, defined in gcc/cgraph.h. This structure is where the front-end writes its parsing result, in the form of trees (see below).

  4. Call init_options_once() (in gcc/opts-global.cc) and init_opts_obstack() (in gcc/opts.cc). They perform some early initialization before processing command-line options. The opts_obstack structure is an "obstack", a dynamic allocation arena with the restriction that free() must occur in the reverse order of malloc(). See https://gcc.gnu.org/onlinedocs/libiberty/Obstacks.html. It is defined and used in gcc/opts-common.cc for processing command-line options.

    One of the actions performed by init_options_once() is to initialize a variable called initial_lang_mask. This variable is used to determine which command-line options this program can take. For cc1 and cc1plus this variable is set to CL_C and CL_CXX respectively. These macros are not found in the GCC codebase. Instead, they are defined in gcc/options.h, a file dynamically generated by gcc/opth-gen.awk during the build process.

  5. Parse the command-line options.

    The file gcc/options.cc (dynamically generated during the build process) contains three symbol called global_options_init, global_options, and global_options_set. The first symbol has type const struct gcc_options and defines the default value of each option (excluding target and language-specific defaults). The second symbol has type struct gcc_options and contains the actually-used value of each option. The third symbol keeps track of which options have been modified from their default values. It is used to implement the macro SET_OPTION_IF_UNSET (see gcc/opts.h).

    After all options have been processed, the global_options variable will no longer be modified. A comment in the process_options() function (in gcc/toplev.cc) says:

    /* Please don't change global_options after this point, those changes won't
       be reflected in optimization_{default,current}_node.  */
    

    The most important command-line option to cc1 is of course the input filename. The C/C++ compiler can only take one translation unit at a time. This filename is stored in global_options.x_main_input_filename and can be simply accessed via main_input_filename.

    One of the functions called during this process is c_common_init_options() in c-family/c-opts.cc. This function sets up a global variable called parse_in. It is declared in c-pragma.h and defined in c-common.cc. It represents the interface to the preprocessor library. Its type is struct cpp_reader but its definition is not exposed in the public headers of libcpp. Instead, it is defined in libcpp/internal.h.

    The parse_in structure is initialized by calling cpp_create_reader(), declared in libcpp/include/cpplib.h. After creating a cpp_reader structure, the behavior of the preprocessor library can be fine-tuned by calling cpp_get_options() and cpp_get_callbacks(). See libcpp/include/cpplib.h for the list of options and callbacks available. For the C/C++ front-ends, most callbacks are set up in the init_c_lex() function defined in gcc/c-family/c-lex.cc. Additionally one can call cpp_get_deps() to obtain an instance of class mkdeps. This is a gadget for printing out the header-dependency information of a translation unit.

    After setting up the options and callbacks, the c_common_post_options() function calls cpp_read_main_file() to specify the input file for libcpp to read. The cpp_read_main_file() function initializes a global variable called line_table. It is declared in gcc/input.h and defined in gcc/input.cc. The purpose of this data structure is to map "location" values to actual locations (line and column numbers) in source files. We will encounter this as we explain the lexer.

  6. Call do_compile() to compile the input file.

    Most of the hard work is actually done in the compile_file() function. This function is split into three phases:

    • TV_PHASE_PARSING: The function calls lang_hooks.parse_file() to parse the input file.
    • TV_PHASE_OPT_GEN: The function calls symtab->finalize_compilation_unit() to run the optimization passes.
    • TV_PHASE_LATE_ASM: The function makes various calls to print the assembler directives to file.

    The parsing step is usually called the "front-end". The optimization step is divided into two large steps called the GIMPLE passes and the RTL passes. The GIMPLE passes are target-independent and are sometimes called the "middle-end". The RTL passes are target-dependent and are sometimes called the "back-end".

The Front-end

The Interface Between Front-end and Middle-end

The interface between the front-end and the middle-end is the symbol table, specifically the symtab global variable defined in gcc/cgraph.cc. However, most operations on the symbol table are defined in gcc/cgraphunit.cc. The comments in gcc/cgraphunit.cc say:

   The front-end is supposed to use following functionality:

    - finalize_function

      This function is called once front-end has parsed whole body of function
      and it is certain that the function body nor the declaration will change.

      (There is one exception needed for implementing GCC extern inline
	function.)

    - varpool_finalize_decl

      This function has same behavior as the above but is used for static
      variables.

    - add_asm_node

      Insert new toplevel ASM statement

    - finalize_compilation_unit

      This function is called once (source level) compilation unit is finalized
      and it will no longer change.

      The symbol table is constructed starting from the trivially needed
      symbols finalized by the frontend.  Functions are lowered into
      GIMPLE representation and callgraph/reference lists are constructed.
      Those are used to discover other necessary functions and variables.

      At the end the bodies of unreachable functions are removed.

      The function can be called multiple times when multiple source level
      compilation units are combined.

In theory this means, the front-end should call finalize_function() for each function, and call varpool_finalize_decl() for each static variable. However, the varpool_finalize_decl() no longer exists and is renamed into varpool_node::finalize_decl(). It is also not called by the C/C++ front-ends (entirely unused in C, used in some very special cases in C++). Instead, the front-ends call rest_of_decl_compilation() in gcc/passes.cc, and this function only calls varpool_node::get_create(). After the front-end finishes the parsing task, the finalize_compilation_unit() function will call process_function_and_variable_attributes() (in cgraphunit.cc), which will call varpool_node::finalize_decl() for each variable.

There is also a rest_of_type_compilation() function in passes.cc, to be called for each type declared. However, this function is almost empty, consisting only of error checking and debugging calls.

The Tree Structure

The declarations that the front-end provide to the middle-end are presented in structures called trees. The concrete type for storing a tree is simply called tree, defined in gcc/coretypes.h:

union tree_node;
typedef union tree_node *tree;
typedef const union tree_node *const_tree;

The "tree" structure is often not really a tree as it contains back references and cross references.

The definitions and helper macros and functions for tree are found in gcc/tree-core.h, gcc/tree.h, and gcc/tree.cc. The tree_node type is a union of various types called struct tree_*. All such structures begin with a member called base with type struct tree_base. The base member might be nested. For example tree_typed has tree_base as its first member, while tree_common has tree_typed as its first member. The tree_base structure begins with an enum called tree_code, which identifies the kind of information this tree node contains.

The various tree codes are defined in gcc/tree.def, gcc/c-family/c-common.def, gcc/c/c-tree.def, and gcc/cp/cp-tree.def. A tree that only uses tree codes in gcc/tree.def is said to be GENERIC, otherwise it is said to contain language-specific tree codes.

The correspondence between tree codes and the tree_* types can be found as follows. Each tree_* type has a corresponding enum entry in enum tree_node_structure_enum (see gcc/tree.h and gcc/treestruct.def). In gcc/tree.cc there is a two-dimensional array called tree_contains_struct. The first dimension corresponds to tree codes. The second dimension corresponds to the enum index of a tree_* type. This array is initialized in the initialize_tree_contains_struct() function, called by the init_ttree() function. Language-specific tree codes are initialized via the init_ts() hook function. If tree_contains_struct[code][ts] is true, then a tree node with tree code code can be cast to tree_TS * where tree_TS is the type corresonding to the enum index ts. For a given code, there may be multiple ts such that tree_contains_struct[code][ts] is true. This is due to the inheritance structure of various tree codes.

The tree codes are classified into "tree code classes". Common classes include tcc_type, tcc_constant, tcc_declaration, tcc_expression, tcc_reference, tcc_exceptional. There are also specialized classes for expressions including tcc_comparison, tcc_unary, tcc_binary, tcc_statement, tcc_vl_exp.

GCC further defines another kind of tree called GIMPLE. This is the representation that the middle-end uses, and all front-ends eventually produce GIMPLE trees. GIMPLE trees are represented using another data type called gimple. The tree codes used by GIMPLE are defined in gcc/gimple.def, and the relevant definitions and macros are in gcc/gimple.h. See https://gcc.gnu.org/onlinedocs/gccint/GENERIC.html and https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html. See also https://ftp.tsukuba.wide.ad.jp/software/gcc/summit/2003/GENERIC%20and%20GIMPLE.pdf.

In most cases, the language front-ends first construct a tree that contains language-specific tree codes. This is the tree dumped by -fdump-tree-original-raw. The function printing the dump is dump_node() in gcc/tree-dump.cc. There is also a debug_tree() function in gcc/print-tree.cc for inspecting trees in GDB. There used to be a functionality called the "tree browser" (https://gcc.gnu.org/projects/tree-ssa/tree-browser.html), but it was removed due to lack of maintenance (https://gcc.gnu.org/legacy-ml/gcc-patches/2015-07/msg01786.html).

After constructing the tree, the pass manager executes several passes to "lower" the tree into GIMPLE. The "GCC Internals" documentation says:

There was some work done on a genericization pass which would run first, but the existence of STMT_EXPR meant that
in order to convert all of the C statements into GENERIC equivalents would involve walking the entire tree anyway,
so it was simpler to lower all the way.

For this reason, the remaining parts of this blog post shall focus only on GENERIC trees with language-specific codes.

The Parser

A good way to study how the parser works is to attach gcc to gdb, provide it with a simple input file, and trace through its execution. We will use the following input file with name src.c:

int func (void) { return 42; }

Our input file does not contain any preprocessor directives. For more complex input files, it is convenient to preprocess the source file first:

cpp -std=c11 -P src.c -o src.i

The -std=c11 option tells cpp not to predefine certain macros that are not standard-compliant. The -P option tells cpp not to emit line number markers. The output file should contain no preprocessor directives. In this case we can pass the option -fpreprocessed to the compiler, so that the preprocessor will almost never get in the way during parsing.

Our command for invoking cc1 is:

gdb --args /opt/aarch64-none-elf/libexec/gcc/aarch64-none-elf/14.2.0/cc1 \
src.c -dumpbase src.c -dumpbase-ext .c -fpreprocessed \
-march=armv8-a+crc+crypto -mtune=cortex-a72.cortex-a53 \
-mlittle-endian -mabi=lp64 \
-Wall -Wextra -Wpedantic -Werror -Wfatal-errors \
-std=c11 -ffreestanding -fno-ident \
-fomit-frame-pointer -fno-asynchronous-unwind-tables \
-fcf-protection=none -fno-stack-protector \
-fno-stack-clash-protection

We disable most hardening options to minimize the output of cc1.

For C and C++ the lang_hooks.parse_file() function is the c_common_parse_file() function, defined in gcc/c-family/c-opts.cc. This function is a wrapper over the c_parse_file() function which is defined separately for C and C++. For C, the c_parse_file() function is defined in gcc/c/c-parser.cc. For C++, the c_parse_file() function is defined in gcc/cp/parser.cc. It is the entrypoint to a recursive-descent parser.

Before calling c_parse_file(), the c_common_parse_file() function first calls dumps->dump_start (TDI_original, &dump_flags). If the command-line option -fdump-tree-original is provided, the parser will dump its original output before any optimization passes. The actual place where the dump is generated is the c_genericize() function, defined in gcc/c-family/c-gimplify.cc. For C++, this function is indirectly called by the cp_genericize() function, defined in gcc/cp/cp-gimplify.cc. By default, the output is a "pretty-printed" form of the source code. However, one can use -fdump-tree-original-raw to get a raw form of the internal representation.

Next, c_common_parse_file() reads the first token of the source file, and check whether it is a PCH (Precompiled Header) directive. To do so, it calls c_parser_peek_token(), which will read the first token from the input, but not consume it.

The interfaces exposed by libcpp for obtaining the next token in the source file are cpp_get_token() and cpp_get_token_with_location(). They are indirectly called through wrappers named get_token() and get_token_no_padding(), defined in gcc/c-family/c-lex.cc. The difference between the two interfaces is that, cpp_get_token_with_location() additionally returns a "location" for the token. The "location" is a value of type location_t (defined as uint64_t in libcpp/include/line-map.h). The comments explain it as follows:

/* The typedef "location_t" is a key within the location database,
   identifying a source location or macro expansion, along with range
   information, and (optionally) a pointer for use by gcc.

   This key only has meaning in relation to a line_maps instance.  Within
   gcc there is a single line_maps instance: "line_table", declared in
   gcc/input.h and defined in gcc/input.cc.

   The values of the keys are intended to be internal to libcpp, but for
   ease-of-understanding the implementation, they are currently assigned as
   follows in the case of 32-bit location_t:

  Actual     | Value                         | Meaning
  -----------+-------------------------------+-------------------------------
  0x00000000 | UNKNOWN_LOCATION (gcc/input.h)| Unknown/invalid location.
  -----------+-------------------------------+-------------------------------
  0x00000001 | BUILTINS_LOCATION             | The location for declarations
             |   (gcc/input.h)               | in "<built-in>"
  -----------+-------------------------------+-------------------------------
  0x00000002 | RESERVED_LOCATION_COUNT       | The first location to be
             | (also                         | handed out, and the
             |  ordmap[0]->start_location)   | first line in ordmap 0
  -----------+-------------------------------+-------------------------------
             | ordmap[1]->start_location     | First line in ordmap 1
             | ordmap[1]->start_location+32  | First column in that line
             |   (assuming range_bits == 5)  |
             | ordmap[1]->start_location+64  | 2nd column in that line
             | ordmap[1]->start_location+4096| Second line in ordmap 1
             |   (assuming column_bits == 12)
             |
             |   Subsequent lines are offset by (1 << column_bits),
             |   e.g. 4096 for 12 bits, with a column value of 0 representing
             |   "the whole line".
             |
             |   Within a line, the low "range_bits" (typically 5) are used for
             |   storing short ranges, so that there's an offset of
             |     (1 << range_bits) between individual columns within a line,
             |   typically 32.
             |   The low range_bits store the offset of the end point from the
             |   start point, and the start point is found by masking away
             |   the range bits.
             |
             |   For example:
             |      ordmap[1]->start_location+64    "2nd column in that line"
             |   above means a caret at that location, with a range
             |   starting and finishing at the same place (the range bits
             |   are 0), a range of length 1.
             |
             |   By contrast:
             |      ordmap[1]->start_location+68
             |   has range bits 0x4, meaning a caret with a range starting at
             |   that location, but with endpoint 4 columns further on: a range
             |   of length 5.
             |
             |   Ranges that have caret != start, or have an endpoint too
             |   far away to fit in range_bits are instead stored as ad-hoc
             |   locations.  Hence for range_bits == 5 we can compactly store
             |   tokens of length <= 32 without needing to use the ad-hoc
             |   table.
             |
             |   This packing scheme means we effectively have
             |     (column_bits - range_bits)
             |   of bits for the columns, typically (12 - 5) = 7, for 128
             |   columns; longer line widths are accomodated by starting a
             |   new ordmap with a higher column_bits.
             |
             | ordmap[2]->start_location-1   | Final location in ordmap 1
  -----------+-------------------------------+-------------------------------
             | ordmap[2]->start_location     | First line in ordmap 2
             | ordmap[3]->start_location-1   | Final location in ordmap 2
  -----------+-------------------------------+-------------------------------
             |                               | (etc)
  -----------+-------------------------------+-------------------------------
             | ordmap[n-1]->start_location   | First line in final ord map
             |                               | (etc)
             | set->highest_location - 1     | Final location in that ordmap
  -----------+-------------------------------+-------------------------------
             | set->highest_location         | Location of the where the next
             |                               | ordinary linemap would start
  -----------+-------------------------------+-------------------------------
             |                               |
             |                  VVVVVVVVVVVVVVVVVVVVVVVVVVV
             |                  Ordinary maps grow this way
             |
             |                    (unallocated integers)
             |
  0x60000000 | LINE_MAP_MAX_LOCATION_WITH_COLS
             |   Beyond this point, ordinary linemaps have 0 bits per column:
             |   each increment of the value corresponds to a new source line.
             |
  0x70000000 | LINE_MAP_MAX_LOCATION
             |   Beyond the point, we give up on ordinary maps; attempts to
             |   create locations in them lead to UNKNOWN_LOCATION (0).
             |
             |                    (unallocated integers)
             |
             |                   Macro maps grow this way
             |                   ^^^^^^^^^^^^^^^^^^^^^^^^
             |                               |
  -----------+-------------------------------+-------------------------------
             | LINEMAPS_MACRO_LOWEST_LOCATION| Locations within macro maps
             | macromap[m-1]->start_location | Start of last macro map
             |                               |
  -----------+-------------------------------+-------------------------------
             | macromap[m-2]->start_location | Start of penultimate macro map
  -----------+-------------------------------+-------------------------------
             | macromap[1]->start_location   | Start of macro map 1
  -----------+-------------------------------+-------------------------------
             | macromap[0]->start_location   | Start of macro map 0
  0x7fffffff | MAX_LOCATION_T                | Also used as a mask for
             |                               | accessing the ad-hoc data table
  -----------+-------------------------------+-------------------------------
  0x80000000 | Start of ad-hoc values; the lower 31 bits are used as an index
  ...        | into the line_table->location_adhoc_data_map.data array.
  0xffffffff | UINT_MAX                      |
  -----------+-------------------------------+-------------------------------

   Examples of location encoding.

   Packed ranges
   =============

   Consider encoding the location of a token "foo", seen underlined here
   on line 523, within an ordinary line_map that starts at line 500:

                 11111111112
        12345678901234567890
     522
     523   return foo + bar;
                  ^~~
     524

   The location's caret and start are both at line 523, column 11; the
   location's finish is on the same line, at column 13 (an offset of 2
   columns, for length 3).

   Line 523 is offset 23 from the starting line of the ordinary line_map.

   caret == start, and the offset of the finish fits within 5 bits, so
   this can be stored as a packed range.

   This is encoded as:
      ordmap->start
         + (line_offset << ordmap->m_column_and_range_bits)
         + (column << ordmap->m_range_bits)
         + (range_offset);
   i.e. (for line offset 23, column 11, range offset 2):
      ordmap->start
         + (23 << 12)
         + (11 << 5)
         + 2;
   i.e.:
      ordmap->start + 0x17162
   assuming that the line_map uses the default of 7 bits for columns and
   5 bits for packed range (giving 12 bits for m_column_and_range_bits).


   "Pure" locations
   ================

   These are a special case of the above, where
      caret == start == finish
   They are stored as packed ranges with offset == 0.
   For example, the location of the "f" of "foo" could be stored
   as above, but with range offset 0, giving:
      ordmap->start
         + (23 << 12)
         + (11 << 5)
         + 0;
   i.e.:
      ordmap->start + 0x17160


   Unoptimized ranges
   ==================

   Consider encoding the location of the binary expression
   below:

                 11111111112
        12345678901234567890
     522
     523   return foo + bar;
                  ~~~~^~~~~
     524

   The location's caret is at the "+", line 523 column 15, but starts
   earlier, at the "f" of "foo" at column 11.  The finish is at the "r"
   of "bar" at column 19.

   This can't be stored as a packed range since start != caret.
   Hence it is stored as an ad-hoc location e.g. 0x80000003.

   Stripping off the top bit gives us an index into the ad-hoc
   lookaside table:

     line_table->location_adhoc_data_map.data[0x3]

   from which the caret, start and finish can be looked up,
   encoded as "pure" locations:

     start  == ordmap->start + (23 << 12) + (11 << 5)
            == ordmap->start + 0x17160  (as above; the "f" of "foo")

     caret  == ordmap->start + (23 << 12) + (15 << 5)
            == ordmap->start + 0x171e0

     finish == ordmap->start + (23 << 12) + (19 << 5)
            == ordmap->start + 0x17260

   To further see how location_t works in practice, see the
   worked example in libcpp/location-example.txt.  */

The explanation above is slightly outdated as location_t is now 64-bit rather than 32-bit, but is mostly correct. We can ignore the part about "macro maps" since we have preprocessed our input. By the time the c_parse_file() function is called, the line_table structure already contains two "ordinary map" entries:

(gdb) p *line_table
$1 = {info_ordinary = {maps = 0x7472db79b000, allocated = 341, used = 2, m_cache = 1}, info_macro = {maps = 0x0, allocated = 0, used = 0, m_cache = 0}, 
  depth = 1, trace_includes = false, seen_line_directive = false, highest_location = 64, highest_line = 64, max_column_hint = 128, 
  m_reallocator = 0x12d444c <realloc_for_line_map(void*, size_t)>, m_round_alloc_size = 0xc269ae <ggc_round_alloc_size(unsigned long)>, 
  m_location_adhoc_data_map = {htab = 0x3d723320, curr_loc = 0, allocated = 0, data = 0x0}, builtin_location = 1, default_range_bits = 5, 
  m_num_optimized_ranges = 0, m_num_unoptimized_ranges = 0}

The reason is that, since we passed the option -fpreprocessed to libcpp, the preprocessor library believes that the input file is generated from another source file, and attempts to locate this source file. This will not succeed, since no such information is embedded into the input file. In this case libcpp writes two entries into line_table. First it writes an entry with reason LC_ENTER, meaning we just started processing the main input file. Then it writes an entry with reason LC_RENAME_VERBATIM, meaning this file is generated from another source file, but that file cannot be found. See the cpp_read_main_file() function in libcpp/init.cc.

Now we look at the second entry in line_table->info_ordinary (this corresponds to ordmap[1] above):

(gdb) p line_table->info_ordinary->maps[1]
$2 = {<line_map> = {start_location = 64}, reason = LC_RENAME_VERBATIM, sysp = 0 '\000', m_column_and_range_bits = 12, m_range_bits = 5, 
  to_file = 0x3d713c70 "src.c", to_line = 1, included_from = 0}

We see that the start_location is 64, range_bits is 5, and column_and_range_bits is 12. Hence the first column of the first line corresponds to location 96, the second column corresponds to location 128, etc.

Now we trace the execution of the function c_parser_peek_token().

Breakpoint 1, c_parse_file () at ../../gcc-14.2.0/gcc/c/c-parser.cc:26868
26868	  memset (&tparser, 0, sizeof tparser);
(gdb) n
26869	  tparser.translate_strings_p = true;
(gdb) n
26870	  tparser.tokens = &tparser.tokens_buf[0];
(gdb) n
26871	  the_parser = &tparser;
(gdb) n
26873	  if (c_parser_peek_token (&tparser)->pragma_kind == PRAGMA_GCC_PCH_PREPROCESS)
(gdb) step
c_parser_peek_token (parser=0x7ffe76075490) at ../../gcc-14.2.0/gcc/c/c-parser.cc:513
513	  if (parser->tokens_avail == 0)
(gdb) n
515	      c_lex_one_token (parser, &parser->tokens[0]);
(gdb) step
c_lex_one_token (parser=0x7ffe76075490, token=0x7ffe76075498, raw=false) at ../../gcc-14.2.0/gcc/c/c-parser.cc:307
307	  timevar_push (TV_LEX);
(gdb) n
309	  if (raw || vec_safe_length (parser->raw_tokens) == 0)
(gdb) n
313					      (parser->lex_joined_string
(gdb) n
311	      token->type = c_lex_with_flags (&token->value, &token->location,
(gdb) step
c_lex_with_flags (value=0x7ffe760754a0, loc=0x7ffe7607549c, cpp_flags=0x7ffe760754a8 "", lex_flags=2) at ../../gcc-14.2.0/gcc/c-family/c-lex.cc:567
567	  unsigned char add_flags = 0;
(gdb) n
568	  enum overflow_type overflow = OT_NONE;
(gdb) n
570	  timevar_push (TV_CPP);
(gdb) n
572	  tok = get_token (parse_in, loc);
(gdb) n
573	  type = tok->type;
(gdb) p *loc
$1 = 98

Between c_parser_peek_token() and get_token() there are two additional wrapper functions. The purpose of c_lex_with_flags() is to convert the token into a tree node (see below). It also sets some flags on the tree node based on the content of the token. The purpose of c_lex_one_token() is to check whether the returned token is a keyword of C/C++. If so, it changes the type of the token from CPP_NAME to CPP_KEYWORD.

When libcpp reads a token that it believes is an identifier, it automatically associates the identifier with a tree. This is achieved as follows. When c_common_init_options() calls cpp_create_reader(), it passes to libcpp pointers to two hash tables called ident_hash and ident_hash_extra, defined in gcc/stringpool.cc. The comments for cpp_create_reader() say:

/* The first hash table argument is for associating a struct cpp_hashnode
   with each identifier.  The second hash table argument is for associating
   a struct cpp_hashnode_extra with each identifier that needs one.  For
   either, pass in a NULL pointer if you want cpplib to create and manage
   the hash table itself, or else pass a suitably initialized hash table to
   be managed external to libcpp, as is done by the C-family frontends.  */

The second hash table seems to be only used for processing #pragma GCC poison. We can ignore it for now.

Each hash table contains a callback function called alloc_node for allocating nodes. The init_stringpool() function in gcc/stringpool.cc will set up these callbacks, so that when a hash table node for an identifier is requested, the callback actually allocates a tree_node, which embeds a hash table node. The tree code is IDENTIFIER_NODE, with structure type tree_identifier. If the identifier is in fact a C keyword, the lang_flag_0 bit in tree_base will be set. (See the c_parse_init() function in gcc/c/c-parser.cc.) The hash table node and the tree node can be interconverted via the macros HT_IDENT_TO_GCC_IDENT() and GCC_IDENT_TO_HT_IDENT(), defined in gcc/tree.h.

Now the location number we get from get_token() is 98, which is equal to 96 + 2. This means the token begins on the first column of the first line, and spans three characters. This is expected, since the first token in our input file is int. Since this is not a PCH directive, we do not consume it, and proceed to the function c_parser_translation_unit().

The c_parser_translation_unit() function is basically a loop that reads the declarations and definitions in the file. After reading all declarations and definitions, it will check if there are non-extern variable definitions whose type is incomplete:

struct x; /* Incomplete type */
struct x var; /* Allowed, provided that a definition of struct x is provided later */
struct x func1 (struct x input); /* Allowed */
struct x func2 (struct x input) { return input; } /* Not allowed */

Each individual declaration is handled by c_parser_external_declaration(), which is a wrapper for c_parser_declaration_or_fndef() that handles matters related to Objective-C. The syntax for declarations and definitions is:

   declaration:
     declaration-specifiers init-declarator-list[opt] ;
     static_assert-declaration

   function-definition:
     declaration-specifiers[opt] declarator declaration-list[opt]
       compound-statement

   declaration-list:
     declaration
     declaration-list declaration

   init-declarator-list:
     init-declarator
     init-declarator-list , init-declarator

   init-declarator:
     declarator simple-asm-expr[opt] gnu-attributes[opt]
     declarator simple-asm-expr[opt] gnu-attributes[opt] = initializer

The declaration-specifiers part refers to the type (e.g. int) and its qualifiers (e.g. const, static). Each declarator is the name of the symbol being declared and its modifiers (e.g. the pointer marker * and the array marker []). The function-definition part is a little bit weird. The declaration-specifiers part describes the return type, but it is optional. If it is omitted, the compiler assumes the function returns int. The declarator part is the function name plus its arguments. The declaration-list part is used for pre-ANSI style function definitions:

int add(a, b) /* the declarator part */
int a, b; /* the declaration-list part */
{ return a + b; }

First, c_parser_declaration_or_fndef() checks that the declaration is not a static assert, and initializes an empty declaration specifier:

  if (static_assert_ok
      && c_parser_next_token_is_keyword (parser, RID_STATIC_ASSERT))
    {
      c_parser_static_assert_declaration (parser);
      return;
    }
  specs = build_null_declspecs ();

Then it calls c_parser_declspecs() to parse the declaration specifier. The specifier is recorded in a structure of type struct c_typespec and has the following members:

  • kind: The kind of the specifier, see enum c_typespec_kind in gcc/c/c-tree.h.
  • spec: A tree representing the specifier. For keywords like int, this is just an IDENTIFIER_NODE.
  • has_enum_type_specifier: Only used for enum type specifiers.
  • expr, expr_const_operands: Only used when typeof is applied to variably modified types (the type of variable length arrays). The sequence of specifiers is aggregated into a structure of type struct c_declspecs by calling declspecs_add_type() (in gcc/c/c-decl.cc) and related functions.

The declspecs_add_type() function is defined in gcc/c/c-decl.cc and this is where the error message "long long long is too long for GCC" is printed (https://stackoverflow.com/questions/15367603/long-long-long-is-too-long-for-gcc).

After parsing all specifiers, c_parser_declaration_or_fndef() calls finish_declspecs() (in gcc/c/c-decl.cc) to compute the final type of the declaration. For our simple input, the return type of func is int. For such basic types there are predefined tree nodes to represent them, provided in arrays called global_trees and integer_types (see gcc/tree.h and gcc/tree.cc). For example, the tree node representing int is called integer_type_node. These nodes are initialized by c_common_nodes_and_builtins() in gcc/c-family/c-common.cc. See also build_common_tree_nodes() in gcc/tree.cc.

The documentation for type nodes in tree.def is as follows:

Each node that represents a data type has a component TYPE_SIZE
   that evaluates either to a tree that is a (potentially non-constant)
   expression representing the type size in bits, or to a null pointer
   when the size of the type is unknown (for example, for incomplete
   types such as arrays of unspecified bound).
   The TYPE_MODE contains the machine mode for values of this type.
   The TYPE_POINTER_TO field contains a type for a pointer to this type,
     or zero if no such has been created yet.
   The TYPE_NEXT_VARIANT field is used to chain together types
     that are variants made by type modifiers such as "const" and "volatile".
   The TYPE_MAIN_VARIANT field, in any member of such a chain,
     points to the start of the chain.
   The TYPE_NAME field contains info on the name used in the program
     for this type (for GDB symbol table output).  It is either a
     TYPE_DECL node, for types that are typedefs, or an IDENTIFIER_NODE
     in the case of structs, unions or enums that are known with a tag,
     or zero for types that have no special name.
   The TYPE_CONTEXT for any sort of type which could have a name or
    which could have named members (e.g. tagged types in C/C++) will
    point to the node which represents the scope of the given type, or
    will be NULL_TREE if the type has "file scope".  For most types, this
    will point to a BLOCK node or a FUNCTION_DECL node, but it could also
    point to a FUNCTION_TYPE node (for types whose scope is limited to the
    formal parameter list of some function type specification) or it
    could point to a RECORD_TYPE, UNION_TYPE or QUAL_UNION_TYPE node
    (for C++ "member" types).
    For non-tagged-types, TYPE_CONTEXT need not be set to anything in
    particular, since any type which is of some type category  (e.g.
    an array type or a function type) which cannot either have a name
    itself or have named members doesn't really have a "scope" per se.

For basic type nodes like integer_type_node, the TYPE_NAME field is a tree node with code TYPE_DECL. The TYPE_DECL node has a type field which points back to the type node (hence the structure is not really a tree). It also has a name field which is an IDENTIFIER_NODE node containing the string "int".

To summarize, after finish_declspecs() we have specs->type == integer_type_node == integer_types[itk_int].

Now c_parser_declaration_or_fndef() calls c_parser_declarator() to parse a declarator. The syntax is:

   declarator:
     pointer[opt] direct-declarator

   direct-declarator:
     identifier
     ( gnu-attributes[opt] declarator )
     direct-declarator array-declarator
     direct-declarator ( parameter-type-list )
     direct-declarator ( identifier-list[opt] )

   pointer:
     * type-qualifier-list[opt]
     * type-qualifier-list[opt] pointer

   type-qualifier-list:
     type-qualifier
     gnu-attributes
     type-qualifier-list type-qualifier
     type-qualifier-list gnu-attributes

   array-declarator:
     [ type-qualifier-list[opt] assignment-expression[opt] ]
     [ static type-qualifier-list[opt] assignment-expression ]
     [ type-qualifier-list static assignment-expression ]
     [ type-qualifier-list[opt] * ]

   parameter-type-list:
     parameter-list
     parameter-list , ...

   parameter-list:
     parameter-declaration
     parameter-list , parameter-declaration

   parameter-declaration:
     declaration-specifiers declarator gnu-attributes[opt]
     declaration-specifiers abstract-declarator[opt] gnu-attributes[opt]

   identifier-list:
     identifier
     identifier-list , identifier

   abstract-declarator:
     pointer
     pointer[opt] direct-abstract-declarator

   direct-abstract-declarator:
     ( gnu-attributes[opt] abstract-declarator )
     direct-abstract-declarator[opt] array-declarator
     direct-abstract-declarator[opt] ( parameter-type-list[opt] )

A declarator is stored in a structure called struct c_declarator. In our simple input, since there are no pointer markers after the declaration specifier, c_parser_declarator() immediately calls c_parser_direct_declarator(). The c_parser_direct_declarator() function first reads the function name and wraps it into a c_declarator called inner. At this point, we have inner.kind == cdk_id, and inner.u.id.id is an identifier node containing "func".

Then c_parser_direct_declarator() calls c_parser_direct_declarator_inner(), which in turn calls c_parser_parms_declarator() to parse the parameter list. In our input file, the parameter list contains only void. The parameter list is stored in a structure called struct c_arg_info, while each parameter is stored in a structure called struct c_parm.

As mentioned before, C functions declarations can appear in two styles. Therefore, c_parser_parms_declarator() first reads the next token and checks whether it is a type name or not. If it is a type name, we assume the declaration is in the conventional form. Otherwise, it is in the old-fashioned form. The conventional form is handled by c_parser_parms_list_declarator(). Each individual parameter is handled by c_parser_parameter_declaration().

The c_parser_parameter_declaration() function calls c_parser_declspecs() and finish_declspecs() to parse the void type specifier. After these function calls, we have specs->type == void_type_node == global_trees[TI_VOID_TYPE].

Then c_parser_parameter_declaration() calls c_parser_declarator() to process a potential parameter name. In our input there is no parameter name after void. In this case c_parser_declarator() returns a c_declarator with kind == cdk_id and u.id.id == NULL.

At this point c_parser_parms_list_declarator() calls push_parm_decl() in c-decl.cc to construct a tree node with code PARM_DECL. The type of the tree node is tree_parm_decl. The type field is set to void_type_node, while the name field is set to NULL. There is also a DECL_ARG_TYPE field (accessed via the DECL_INITIAL() macro), which is used to store the type of the argument after parameter type promotion. See https://en.cppreference.com/w/c/language/conversion.html.

Since void is the last item in the parameter list, c_parser_parms_list_declarator() calls get_parm_info() and returns. This function does some sanity checks on the parameter list, and returns a structure called struct c_arg_info.

Now c_parser_declaration_or_fndef() calls start_function(). This function calls grokdeclarator() in gcc/c/c-decl.cc to process the function declarator. The grokdeclarator() function takes the declarator part (containing function name and parameters) and the declaration specifier (containing the return type and other attributes), and builds a tree with code FUNCTION_DECL representing the (currently empty) function. The actual function that builds the tree node is build_function_type() and build_decl() in gcc/tree.cc.

The documentation for FUNCTION_DECL in gcc/tree.def says:

FUNCTION_DECLs use four special fields:
   DECL_ARGUMENTS holds a chain of PARM_DECL nodes for the arguments.
   DECL_RESULT holds a RESULT_DECL node for the value of a function.
    The DECL_RTL field is 0 for a function that returns no value.
    (C functions returning void have zero here.)
    The TREE_TYPE field is the type in which the result is actually
    returned.  This is usually the same as the return type of the
    FUNCTION_DECL, but it may be a wider integer type because of
    promotion.
   DECL_FUNCTION_CODE is a code number that is nonzero for
    built-in functions.  Its value is an enum built_in_function
    that says which built-in function it is.

If we call debug_tree() on the return value of grokdeclarator(), we see that:

(gdb) p debug_tree((tree)0x7b97e2ed0000)
 <function_decl 0x7b97e2ed0000 func
    type <function_type 0x7b97e302c888
        type <integer_type 0x7b97e301c5e8 int public SI
            size <integer_cst 0x7b97e301f150 constant 32>
            unit-size <integer_cst 0x7b97e301f168 constant 4>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7b97e301c5e8 precision:32 min <integer_cst 0x7b97e301f108 -2147483648> max <integer_cst 0x7b97e301f120 2147483647>
            pointer_to_this <pointer_type 0x7b97e3024b28>>
        SI size <integer_cst 0x7b97e301f150 32> unit-size <integer_cst 0x7b97e301f168 4>
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7b97e302c888
        arg-types <tree_list 0x7b97e300c988 value <void_type 0x7b97e301cf18 void>>>
    public DI ptest.c:1:5 align:32 warn_if_not_align:0>
$15 = void

The first line corresponds to the root node returned by grokdeclarator(). The 0x7b97e2ed0000 part is the address of the tree node. The func part is the name of the function. The second line is the tree node that represents the type information of the function. Lines 3--8 describe the return type int. Lines 9--11 are the other fields of a FUNCTION_TYPE. The SI size <integer_cst 0x7b97e301f150 32> unit-size <integer_cst 0x7b97e301f168 4> part is set by layout_type() in gcc/stor-layout.cc. It simply specifies that the beginning of a function must be aligned on a 32-byte boundary:

      /* It's hard to see what the mode and size of a function ought to
	       be, but we do know the alignment is FUNCTION_BOUNDARY, so
	       make it consistent with that.  */
      SET_TYPE_MODE (type,
		     int_mode_for_size (FUNCTION_BOUNDARY, 0).else_blk ());
      TYPE_SIZE (type) = bitsize_int (FUNCTION_BOUNDARY);
      TYPE_SIZE_UNIT (type) = size_int (FUNCTION_BOUNDARY / BITS_PER_UNIT);

The else_blk() call means that, if there is no machine mode for the specified size (FUNCTION_BOUNDARY), then return BLKmode which is a "blanket mode" for types that otherwise cannot be assigned a machine mode. See https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html for an introduction to machine modes. The symtab:0 part is only for debugging and is always 0 unless debug hooks are attached. See the comments for the TYPE_SYMTAB_ADDRESS() macro in gcc/tree.h. The warn_if_not_align:0 part is "the minimum alignment necessary for objects of this type without warning." See the warn_if_not_align attribute (https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html). The alias-set -1 part is related to pointer aliasing analysis. The comments for TYPE_ALIAS_SET() in gcc/tree.h say:

/* The (language-specific) typed-based alias set for this type.
   Objects whose TYPE_ALIAS_SETs are different cannot alias each
   other.  If the TYPE_ALIAS_SET is -1, no alias set has yet been
   assigned to this type.  If the TYPE_ALIAS_SET is 0, objects of this
   type can alias objects of any type.  */

The canonical-type field is a pointer to the "canonical" node of a type, which is, informally speaking, the first node created for a type. For example, if one writes typedef int int32_t then the canonical type of int32_t is int. Later, if one uses the type int32_t * then the canonical type of int32_t * is int *. The canonical node of a function type is the function type with its return type and argument types replaced with corresponding canonical nodes. In line 12, public indicates this function declaration has external visibility. The ptest.c:1:5 part indicates the location of the function name in the source file (line 1 column 5). The DI and align:32 parts are set by the make_node() function in gcc/tree.cc:

	  if (code == FUNCTION_DECL)
	    {
	      SET_DECL_ALIGN (t, FUNCTION_ALIGNMENT (FUNCTION_BOUNDARY)); /* FUNCTION_BOUNDARY == 32 */
	      SET_DECL_MODE (t, FUNCTION_MODE); /* FUNCTION_MODE == Pmode == DImode, see config/aarch64/aarch64.h */
	    }

Unless the return type of a function is void, it needs a place to store its return value. In the tree representation this is declared with a special tree node with code RESULT_DECL. It is created at the end of start_function():

  /* Set the result decl source location to the location of the typespec.  */
  result_loc = (declspecs->locations[cdw_typespec] == UNKNOWN_LOCATION
		? loc : declspecs->locations[cdw_typespec]);
  restype = TREE_TYPE (TREE_TYPE (current_function_decl));
  resdecl = build_decl (result_loc, RESULT_DECL, NULL_TREE, restype);
  DECL_ARTIFICIAL (resdecl) = 1;
  DECL_IGNORED_P (resdecl) = 1;
  DECL_RESULT (current_function_decl) = resdecl;

The DECL_ARTIFICIAL flag means this entity is compiler-generated. The DECL_IGNORED_P flag means no debug symbol should be generated for this entity.

After the modifications made by start_function(), the function declaration tree now looks like:

(gdb) p debug_tree(current_function_decl)
 <function_decl 0x776f206d0000 func
    type <function_type 0x776f2082c888
        type <integer_type 0x776f2081c5e8 int public SI
            size <integer_cst 0x776f2081f150 constant 32>
            unit-size <integer_cst 0x776f2081f168 constant 4>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x776f2081c5e8 precision:32 min <integer_cst 0x776f2081f108 -2147483648> max <integer_cst 0x776f2081f120 2147483647>
            pointer_to_this <pointer_type 0x776f20824b28>>
        SI size <integer_cst 0x776f2081f150 32> unit-size <integer_cst 0x776f2081f168 4>
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x776f2082c888
        arg-types <tree_list 0x776f2080c988 value <void_type 0x776f2081cf18 void>>>
    public static DI ptest.c:1:5 align:32 warn_if_not_align:0 initial <error_mark 0x776f20802ee8>
    result <result_decl 0x776f206d1000 D.3508 type <integer_type 0x776f2081c5e8 int>
        ignored SI ptest.c:1:1 size <integer_cst 0x776f2081f150 32> unit-size <integer_cst 0x776f2081f168 4>
        align:32 warn_if_not_align:0>>
$1 = void

The static flag of function_decl indicates that the function has static storage, like global variables. It is not the static specifier of the C language. The initial field temporarily contains error_mark. It will later be replaced by a tree with code BLOCK, representing a tree of variable scopes. The D.3508 part of result_decl is a unique identifier for declarations. See the make_node() and the allocate_decl_uid() functions in gcc/tree.cc.

Next c_parser_declaration_or_fndef() calls store_parm_decls(). The comments say the purpose of this function is:

Store the parameter declarations into the current function declaration.

This is different from the arg-types field of the FUNCTION_TYPE node. The arg-types field only specifies the type of each argument. The store_parm_decls() function fills in the DECL_ARGUMENTS field of the FUNCTION_DECL node. This is a linked list of PARM_DECL nodes that specify the names of the arguments.

Another purpose of store_parm_decls() is to set up the struct-function field of the FUNCTION_DECL node. This is a structure of type struct function that records many properties about the function. It is initialized by calling allocate_struct_function() in gcc/function.cc. After calling allocate_struct_function(), the structure can be accessed via either DECL_STRUCT_FUNCTION(current_function_decl) or cfun (a global variable declared in gcc/function.h and defined in gcc/function.cc). Most fields of struct function are for the middle-end, and the front-end should not touch them. However, there is a language field which the front-end can use to store language-specific data. After finishing parsing the function, the front-end should call free_after_parsing() in gcc/function.cc which will set language to NULL. There are also two fields called function_start_locus and function_end_locus which should be set to the beginning and end locations of the function.

After processing the function declarator, the parser calls c_parser_compound_statement() to process the function body. The body of a function is a tree node with code BIND_EXPR. A BIND_EXPR node has the following fields:

  • BIND_EXPR_VARS: A chain of VAR_DECL nodes for the variables declared at this scope. If the body contains nested scopes (e.g. the body of a loop), then each nested scope has its own BIND_EXPR node. The variables declared in those nested scopes should not appear here.
  • BIND_EXPR_BODY: A node with code STATEMENT_LIST, which is a linked list of statements. Each variable in BIND_EXPR_VARS should appear as a DECL_EXPR node here. Also, each jump label in the source code should appear as a LABEL_EXPR node here.
  • BIND_EXPR_BLOCK: A node with code BLOCK. Each scope in a function body has a corresponding BLOCK node. It is used for generating debug information. The BLOCK node has the following fields:
    • BLOCK_VARS: Identical to the BIND_EXPR_VARS field.
    • BLOCK_SUPERCONTEXT: A pointer to the parent block. For the body of a function, the parent block is the FUNCTION_DECL node. For nested scopes, the parent block is the outer scope.
    • BLOCK_CHAIN: A pointer to the next block at the same level.
    • BLOCK_ABSTRACT_ORIGIN: Appears to be only used by the middle-end for processing function inlining. Front-ends can leave this field as NULL.

The BIND_EXPR also has a TREE_TYPE field. This field is usually void_type_node. If it is not then this BIND_EXPR is a "compound statement expression" (https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html).

In our input file, the only statement in the function body is a return statement. This statement is represented with a tree node with code RETURN_EXPR. The documentation for RETURN_EXPR in gcc/tree.def is:

/* RETURN.  Evaluates operand 0, then returns from the current function.
   Presumably that operand is an assignment that stores into the
   RESULT_DECL that hold the value to be returned.
   The operand may be null.
   The type should be void and the value should be ignored.  */

If the function does not return any value, then operand 0 is simply NULL.

After parsing the whole function, the BIND_EXPR node representing the body of a function is written into the DECL_SAVED_TREE field of the FUNCTION_DECL node. The corresponding BLOCK node is written into the DECL_INITIAL field of the FUNCTION_DECL node. Also, every variable (including the parameters and the RESULT_DECL) and label declared in the function will have its DECL_CONTEXT field set to the FUNCTION_DECL node. These operations are done in the pop_scope() and finish_function() functions in gcc/c/c-decl.cc.

The comments for finish_function() say:

/* Finish up a function declaration and compile that function
   all the way to assembler language output.  Then free the storage
   for the function definition.

   This is called after parsing the body of the function definition.  */

This is no longer accurate: GCC does not "compile a function all the way to assembly" before processing the next function. It used to do so in the very early days (see https://ftp.tsukuba.wide.ad.jp/software/gcc/summit/2003/GENERIC%20and%20GIMPLE.pdf), but now finish_function() does the following thing:

  • Call c_genericize() on the FUNCTION_DECL node. This function will replace some C-specific tree codes in the declaration with GENERIC codes. It does not completely convert the tree to GENERIC though. Later the middle-end will invoke a callback function (c_gimplify_expr() in gcc/c-family/c-gimplify.cc) to convert remaining C-specific codes to GIMPLE.
  • Call cgraph_node::finalize_function() on the FUNCTION_DECL node. By the time this function is called, the FUNCTION_DECL node contains the following data:
(gdb) p debug_tree(decl)
 <function_decl 0x7aa299cd0000 func
    type <function_type 0x7aa299e2c888
        type <integer_type 0x7aa299e1c5e8 int public SI
            size <integer_cst 0x7aa299e1f150 constant 32>
            unit-size <integer_cst 0x7aa299e1f168 constant 4>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7aa299e1c5e8 precision:32 min <integer_cst 0x7aa299e1f108 -2147483648> max <integer_cst 0x7aa299e1f120 2147483647>
            pointer_to_this <pointer_type 0x7aa299e24b28>>
        SI size <integer_cst 0x7aa299e1f150 32> unit-size <integer_cst 0x7aa299e1f168 4>
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7aa299e2c888
        arg-types <tree_list 0x7aa299e0c988 value <void_type 0x7aa299e1cf18 void>>>
    public static DI ptest.c:1:5 align:32 warn_if_not_align:0 initial <block 0x7aa299cd3000>
    result <result_decl 0x7aa299cd1000 D.3508 type <integer_type 0x7aa299e1c5e8 int>
        ignored SI ptest.c:1:1 size <integer_cst 0x7aa299e1f150 32> unit-size <integer_cst 0x7aa299e1f168 4>
        align:32 warn_if_not_align:0 context <function_decl 0x7aa299cd0000 func>>
    struct-function 0x7aa299cd2000>
$1 = void

(gdb) p debug_tree(((struct tree_function_decl *)decl)->saved_tree)
 <bind_expr 0x7aa299ccf0f0
    type <void_type 0x7aa299e1cf18 void VOID
        align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7aa299e1cf18
        pointer_to_this <pointer_type 0x7aa299e24000>>
    side-effects
    body <return_expr 0x7aa299ca41c0 type <void_type 0x7aa299e1cf18 void>
        side-effects
        arg:0 <modify_expr 0x7aa299ca2a00 type <integer_type 0x7aa299e1c5e8 int>
            side-effects arg:0 <result_decl 0x7aa299cd1000 D.3508>
            arg:1 <integer_cst 0x7aa299f0ee40 constant 42>
            ptest.c:1:26 start: ptest.c:1:26 finish: ptest.c:1:27>
        ptest.c:1:26 start: ptest.c:1:26 finish: ptest.c:1:27>
    block <block 0x7aa299cd3000 used
        supercontext <function_decl 0x7aa299cd0000 func type <function_type 0x7aa299e2c888>
            public static DI ptest.c:1:5 align:32 warn_if_not_align:0 initial <block 0x7aa299cd3000> result <result_decl 0x7aa299cd1000 D.3508>
            struct-function 0x7aa299cd2000>>
    ptest.c:1:17 start: ptest.c:1:17 finish: ptest.c:1:17>
$4 = void

The side-effects flag indicates that evaluating an expression has side effects. The side effect here, of course, is that the result variable will be modified, and the function call will return. The used flag on the block node indicates that the block is tied to a scope in the function. In most cases it should be set.

The dumped "raw tree" is an alternative presentation of the tree structure above. It begins from the BIND_EXPR node, not the FUNCTION_DECL node. It also ignores the BLOCK node.

;; Function func (null)
;; enabled by -tree-original

@1      bind_expr        type: @2       body: @3      
@2      void_type        name: @4       algn: 8       
@3      return_expr      type: @2       expr: @5      
@4      type_decl        name: @6       type: @2      
@5      modify_expr      type: @7       op 0: @8       op 1: @9      
@6      identifier_node  strg: void     lngt: 4       
@7      integer_type     name: @10      size: @11      algn: 32      
                         prec: 32       sign: signed   min : @12     
                         max : @13     
@8      result_decl      type: @7       scpe: @14      srcp: ptest.c:1      
                         note: artificial              size: @11     
                         algn: 32      
@9      integer_cst      type: @7      int: 42
@10     type_decl        name: @15      type: @7      
@11     integer_cst      type: @16     int: 32
@12     integer_cst      type: @7      int: -2147483648
@13     integer_cst      type: @7      int: 2147483647
@14     function_decl    name: @17      type: @18      srcp: ptest.c:1      
                         link: extern  
@15     identifier_node  strg: int      lngt: 3       
@16     integer_type     name: @19      size: @20      algn: 128     
                         prec: 128      sign: unsigned min : @21     
                         max : @22     
@17     identifier_node  strg: func     lngt: 4       
@18     function_type    size: @11      algn: 32       retn: @7      
                         prms: @23     
@19     identifier_node  strg: bitsizetype             lngt: 11      
@20     integer_cst      type: @16     int: 128
@21     integer_cst      type: @16     int: 0
@22     integer_cst      type: @16     int: -1
@23     tree_list        valu: @2

More Complex Inputs

For more complex inputs, our strategy will be as follows:

  1. We set break points on the functions cgraph_node::finalize_function(), rest_of_decl_compilation(), and rest_of_type_compilation().
  2. At each breakpoint, we call debug_tree() to inspect the tree the front-end has built.
  3. If there are fields whose purpose we don't recognize, trace the front-end code to see where that field was written.

We shall look at the following string-processing function:

typedef unsigned int uint32_t;

uint32_t remove_slash (const char * input, uint32_t input_buf_len, char * output, uint32_t output_buf_len) {
  if (output_buf_len == 0) return 0;
  if (output_buf_len == 1) { output[0] = 0; return 0; }

  uint32_t output_len = 0;
  uint32_t prev_char_is_slash = 0;

  for (uint32_t i = 0; i < input_buf_len; i++) {

    if (input[i] == 0) {
      if (prev_char_is_slash) output[output_len++] = '\\';
      output[output_len] = 0; return output_len;
    }

    if (prev_char_is_slash) {

      if (input[i] == '\\') {

	output[output_len++] = '\\';
	if (output_len == output_buf_len - 1) { output[output_len] = 0; return output_len; }

      } else if (input[i] == 'n') {

	output[output_len++] = '\n';
	if (output_len == output_buf_len - 1) { output[output_len] = 0; return output_len; }
	
      } else {

	output[output_len++] = '\\';
	if (output_len == output_buf_len - 1) { output[output_len] = 0; return output_len; }

	output[output_len++] = input[i];
	if (output_len == output_buf_len - 1) { output[output_len] = 0; return output_len; }

      }

      prev_char_is_slash = 0;

    } else {

      if (input[i] == '\\') {

	prev_char_is_slash = 1;

      } else {

	output[output_len++] = input[i];
	if (output_len == output_buf_len - 1) { output[output_len] = 0; return output_len; }

      }

    }

  }

  output[output_len] = 0;
  return output_len;
}

This function takes a string as input and processes the escape sequences in it.

The first declaration to be handled is typedef unsigned int uint32_t, which is a typedef. After c_parser_declspecs() parses the declaration specifiers (typedef unsigned int) and c_parser_declarator() parses the declarator (uint32_t), c_parser_declaration_or_fndef() calls start_decl(). As before this function calls grokdeclarator(). The tree returned by grokdeclarator() is:

 <type_decl 0x72988ed1ad10 uint32_t                                           
    type <integer_type 0x72988ec1c690 unsigned int public unsigned SI         
        size <integer_cst 0x72988ec1f150 constant 32>                         
        unit-size <integer_cst 0x72988ec1f168 constant 4>                     
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x72988ec1c690 precision:32 min <integer_cst 0x72988ec1f180 0> max <integer_cst 0x72988ec1f138 4294967295>                                                      
        pointer_to_this <pointer_type 0x72988ec2c000>>                        
    VOID ptest2.c:1:22                                                        
    align:1 warn_if_not_align:0>                                              

The VOID part is the machine mode of this declaration. For TYPE_DECL, the machine mode is always VOID. A comment in gcc/machmode.def says:

/* VOIDmode is used when no mode needs to be specified,
   as for example on CONST_INT RTL expressions.  */

Then start_decl() calls pushdecl() to add the type declaration into the global scope. The pushdecl() function makes two interesting calls. First it calls set_underlying_type() in gcc/c-family/c-common.cc. The comments explain the purpose of this function as follows:

/* Setup a TYPE_DECL node as a typedef representation.

   X is a TYPE_DECL for a typedef statement.  Create a brand new
   ..._TYPE node (which will be just a variant of the existing
   ..._TYPE node with identical properties) and then install X
   as the TYPE_NAME of this brand new (duplicate) ..._TYPE node.

   The whole point here is to end up with a situation where each
   and every ..._TYPE node the compiler creates will be uniquely
   associated with AT MOST one node representing a typedef name.
   This way, even though the compiler substitutes corresponding
   ..._TYPE nodes for TYPE_DECL (i.e. "typedef name") nodes very
   early on, later parts of the compiler can always do the reverse
   translation and get back the corresponding typedef name.  For
   example, given:

	typedef struct S MY_TYPE;
	MY_TYPE object;

   Later parts of the compiler might only know that `object' was of
   type `struct S' if it were not for code just below.  With this
   code however, later parts of the compiler see something like:

	struct S' == struct S
	typedef struct S' MY_TYPE;
	struct S' object;

    And they can then deduce (from the node for type struct S') that
    the original object declaration was:

		MY_TYPE object;

    Being able to do this is important for proper support of protoize,
    and also for generating precise symbolic debugging information
    which takes full account of the programmer's (typedef) vocabulary.

    Obviously, we don't want to generate a duplicate ..._TYPE node if
    the TYPE_DECL node that we are now processing really represents a
    standard built-in type.  */

In other words, each time typedef is used to introduce a new type name, a variant of the original type is created by calling build_variant_type_copy() in gcc/tree.cc. The original type is stored into the DECL_ORIGINAL_TYPE field of the variant type, which is actually the same field as DECL_RESULT field. (The DECL_RESULT field is used only for FUNCTION_DECL nodes.)

The second call is record_locally_defined_typedef() in gcc/c-family/c-warn.cc. The purpose is simply to warn about unused typedefs.

Eventually, c_parser_declaration_or_fndef() calls finish_decl() which in turn calls rest_of_decl_compilation(). At this point, the tree node corresponding to the declaration looks like:

 <type_decl 0x70eac111ad10 uint32_t                                           
    type <integer_type 0x70eac0e9cdc8 uint32_t unsigned SI                    
        size <integer_cst 0x70eac101f150 constant 32>                         
        unit-size <integer_cst 0x70eac101f168 constant 4>                     
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x70eac101c690 precision:32 min <integer_cst 0x70eac101f180 0> max <integer_cst 0x70eac101f138 4294967295>>                                                     
    VOID ptest2.c:1:22                                                        
    align:1 warn_if_not_align:0                                               
    result <integer_type 0x70eac101c690 unsigned int public unsigned SI size <integer_cst 0x70eac101f150 32> unit-size <integer_cst 0x70eac101f168 4>       
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x70eac101c690 precision:32 min <integer_cst 0x70eac101f180 0> max <integer_cst 0x70eac101f138 4294967295>                                                      
        pointer_to_this <pointer_type 0x70eac102c000>>>                       

The result field is in fact DECL_ORIGINAL_TYPE, as explained above.

The declaration for the remove_slash() function is parsed as follows:

 <function_decl 0x7c3666aa3800 remove_slash
    type <function_type 0x7c3666a9cb28
        type <integer_type 0x7c3666a9cdc8 uint32_t public unsigned SI
            size <integer_cst 0x7c3666c1f150 constant 32>
            unit-size <integer_cst 0x7c3666c1f168 constant 4>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7c3666c1c690 precision:32 min <integer_cst 0x7c3666c1f180 0> max <integer_cst 0x7c3666c1f138 4294967295>>
        SI size <integer_cst 0x7c3666c1f150 32> unit-size <integer_cst 0x7c3666c1f168 4>
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7c3666a9cbd0
        arg-types <tree_list 0x7c3666a9dd48 value <pointer_type 0x7c3666c2a1f8>
            chain <tree_list 0x7c3666a9dd20 value <integer_type 0x7c3666a9cdc8 uint32_t>
                chain <tree_list 0x7c3666a9dcf8 value <pointer_type 0x7c3666c2a0a8>
                    chain <tree_list 0x7c3666a9dcd0 value <integer_type 0x7c3666a9cdc8 uint32_t>
                        chain <tree_list 0x7c3666c0c988 value <void_type 0x7c3666c1cf18 void>>>>>>>
    public static DI ptest2.c:3:10 align:32 warn_if_not_align:0 initial <block 0x7c3666ad3060>
    result <result_decl 0x7c3666acd000 D.3485 type <integer_type 0x7c3666a9cdc8 uint32_t>
        unsigned ignored SI ptest2.c:3:10 size <integer_cst 0x7c3666c1f150 32> unit-size <integer_cst 0x7c3666c1f168 4>
        align:32 warn_if_not_align:0 context <function_decl 0x7c3666aa3800 remove_slash>>
    arguments <parm_decl 0x7c3667333080 input
        type <pointer_type 0x7c3666c2a1f8 type <integer_type 0x7c3666c2a150 char>
            unsigned DI
            size <integer_cst 0x7c3666c02f00 constant 64>
            unit-size <integer_cst 0x7c3666c02f18 constant 8>
            align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7c3666c2a1f8>
        used unsigned read DI ptest2.c:3:37 size <integer_cst 0x7c3666c02f00 64> unit-size <integer_cst 0x7c3666c02f18 8>
        align:64 warn_if_not_align:0 context <function_decl 0x7c3666aa3800 remove_slash> arg-type <pointer_type 0x7c3666c2a1f8>
        chain <parm_decl 0x7c3667333100 input_buf_len type <integer_type 0x7c3666a9cdc8 uint32_t>
            used unsigned read SI ptest2.c:3:53 size <integer_cst 0x7c3666c1f150 32> unit-size <integer_cst 0x7c3666c1f168 4>
            align:32 warn_if_not_align:0 context <function_decl 0x7c3666aa3800 remove_slash> arg-type <integer_type 0x7c3666a9cdc8 uint32_t> chain <parm_decl 0x7c3667333180 output>>>
    struct-function 0x7c3666ace000>

As mentioned before, the "canonical type" of the function type is the function type with the result type and each argument type replaced by their canonical versions:

(gdb) p debug_tree((tree)0x7f44bbe9cbd0)
 <function_type 0x7f44bbe9cbd0
    type <integer_type 0x7f44bc01c690 unsigned int public unsigned SI
        size <integer_cst 0x7f44bc01f150 constant 32>
        unit-size <integer_cst 0x7f44bc01f168 constant 4>
        align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7f44bc01c690 precision:32 min <integer_cst 0x7f44bc01f180 0> max <integer_cst 0x7f44bc01f138 4294967295>
        pointer_to_this <pointer_type 0x7f44bc02c000>>
    SI size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
    align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7f44bbe9cbd0
    arg-types <tree_list 0x7f44bbe9dd70
        value <pointer_type 0x7f44bc02a1f8 type <integer_type 0x7f44bc02a150 char>
            unsigned DI
            size <integer_cst 0x7f44bc002f00 constant 64>
            unit-size <integer_cst 0x7f44bc002f18 constant 8>
            align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7f44bc02a1f8>
        chain <tree_list 0x7f44bbe9dd98 value <integer_type 0x7f44bc01c690 unsigned int>
            chain <tree_list 0x7f44bbe9ddc0 value <pointer_type 0x7f44bc02a0a8>
                chain <tree_list 0x7f44bbe9dde8 value <integer_type 0x7f44bc01c690 unsigned int>
                    chain <tree_list 0x7f44bc00c988 value <void_type 0x7f44bc01cf18 void>>>>>>>

Notice that the arg-type field of the function type ends with a void entry.

The initial field of function_decl points to a block that represents the top-level variable scope:

(gdb) p debug_tree((tree)0x7f44bbed3060)
 <block 0x7f44bbed3060 used
    vars <var_decl 0x7f44bbed0000 output_len
        type <integer_type 0x7f44bbe9cdc8 uint32_t public unsigned SI
            size <integer_cst 0x7f44bc01f150 constant 32>
            unit-size <integer_cst 0x7f44bc01f168 constant 4>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7f44bc01c690 precision:32 min <integer_cst 0x7f44bc01f180 0> max <integer_cst 0x7f44bc01f138 4294967295>>
        used unsigned read SI ptest2.c:7:12 size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
        align:32 warn_if_not_align:0 context <function_decl 0x7f44bbea3800 remove_slash> initial <integer_cst 0x7f44bc10bc60 0>
        chain <var_decl 0x7f44bbed0090 prev_char_is_slash type <integer_type 0x7f44bbe9cdc8 uint32_t>
            used unsigned read SI ptest2.c:8:12 size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
            align:32 warn_if_not_align:0 context <function_decl 0x7f44bbea3800 remove_slash> initial <integer_cst 0x7f44bc10bc60 0>>>
    supercontext <function_decl 0x7f44bbea3800 remove_slash
        type <function_type 0x7f44bbe9cb28 type <integer_type 0x7f44bbe9cdc8 uint32_t>
            SI size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7f44bbe9cbd0
            arg-types <tree_list 0x7f44bbe9dd48 value <pointer_type 0x7f44bc02a1f8>
                chain <tree_list 0x7f44bbe9dd20 value <integer_type 0x7f44bbe9cdc8 uint32_t>
                    chain <tree_list 0x7f44bbe9dcf8 value <pointer_type 0x7f44bc02a0a8>
                        chain <tree_list 0x7f44bbe9dcd0 value <integer_type 0x7f44bbe9cdc8 uint32_t> chain <tree_list 0x7f44bc00c988>>>>>>
        public static DI ptest2.c:3:10 align:32 warn_if_not_align:0 initial <block 0x7f44bbed3060>
        result <result_decl 0x7f44bbecd000 D.3485 type <integer_type 0x7f44bbe9cdc8 uint32_t>
            unsigned ignored SI ptest2.c:3:10 size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
            align:32 warn_if_not_align:0 context <function_decl 0x7f44bbea3800 remove_slash>>
        arguments <parm_decl 0x7f44bc804080 input type <pointer_type 0x7f44bc02a1f8>
            used unsigned read DI ptest2.c:3:37
            size <integer_cst 0x7f44bc002f00 constant 64>
            unit-size <integer_cst 0x7f44bc002f18 constant 8>
            align:64 warn_if_not_align:0 context <function_decl 0x7f44bbea3800 remove_slash> arg-type <pointer_type 0x7f44bc02a1f8> chain <parm_decl 0x7f44bc804100 input_buf_len>>
        struct-function 0x7f44bbece000>
    subblocks <block 0x7f44bbed3000 used
        vars <var_decl 0x7f44bbed0120 i type <integer_type 0x7f44bbe9cdc8 uint32_t>
            used unsigned read SI ptest2.c:10:17 size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
            align:32 warn_if_not_align:0 context <function_decl 0x7f44bbea3800 remove_slash> initial <integer_cst 0x7f44bc10bc60 0>> supercontext <block 0x7f44bbed3060>>>

There are two variables defined in the top-level scope, called output_len and prev_char_is_slash. The for loop introduces a subblock, and this subblock contains another variable called i.

The body of the function is represented as follows:

(gdb) p debug_tree(((struct tree_function_decl *)decl)->saved_tree)
 <bind_expr 0x7f44bbecade0
    type <void_type 0x7f44bc01cf18 void VOID
        align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7f44bc01cf18
        pointer_to_this <pointer_type 0x7f44bc024000>>
    side-effects
    vars <var_decl 0x7f44bbed0000 output_len
        type <integer_type 0x7f44bbe9cdc8 uint32_t public unsigned SI
            size <integer_cst 0x7f44bc01f150 constant 32>
            unit-size <integer_cst 0x7f44bc01f168 constant 4>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7f44bc01c690 precision:32 min <integer_cst 0x7f44bc01f180 0> max <integer_cst 0x7f44bc01f138 4294967295>>
        used unsigned read SI ptest2.c:7:12 size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
        align:32 warn_if_not_align:0 context <function_decl 0x7f44bbea3800 remove_slash> initial <integer_cst 0x7f44bc10bc60 0>
        chain <var_decl 0x7f44bbed0090 prev_char_is_slash type <integer_type 0x7f44bbe9cdc8 uint32_t>
            used unsigned read SI ptest2.c:8:12 size <integer_cst 0x7f44bc01f150 32> unit-size <integer_cst 0x7f44bc01f168 4>
            align:32 warn_if_not_align:0 context <function_decl 0x7f44bbea3800 remove_slash> initial <integer_cst 0x7f44bc10bc60 0>>>
    body <statement_list 0x7f44bbe9e1a0 type <void_type 0x7f44bc01cf18 void>
        side-effects head 0x7f44bc10bca8 tail 0x7f44bbed2348 stmts 0x7f44bbeca870 0x7f44bbeca8d0 0x7f44bbe9e2a0 0x7f44bbe9e2c0 0x7f44bbecadb0 0x7f44bbed4000 0x7f44bbe9ef00

        stmt <cond_expr 0x7f44bbeca870 type <void_type 0x7f44bc01cf18 void>
            side-effects
            arg:0 <eq_expr 0x7f44bbe9ded8 type <integer_type 0x7f44bc01c5e8 int>
                arg:0 <parm_decl 0x7f44bc804200 output_buf_len>
                arg:1 <integer_cst 0x7f44bc10bc60 constant 0>
                ptest2.c:4:22 start: ptest2.c:4:7 finish: ptest2.c:4:25>
            arg:1 <return_expr 0x7f44bbe9e200 type <void_type 0x7f44bc01cf18 void>
                side-effects
                arg:0 <modify_expr 0x7f44bbe9df00 type <integer_type 0x7f44bbe9cdc8 uint32_t>
                    side-effects arg:0 <result_decl 0x7f44bbecd000 D.3485> arg:1 <integer_cst 0x7f44bc10bc60 0>
                    ptest2.c:4:35 start: ptest2.c:4:35 finish: ptest2.c:4:35>
                ptest2.c:4:35 start: ptest2.c:4:35 finish: ptest2.c:4:35>
            ptest2.c:4:6 start: ptest2.c:4:6 finish: ptest2.c:4:6>
        stmt <cond_expr 0x7f44bbeca8d0 type <void_type 0x7f44bc01cf18 void>
            side-effects
            arg:0 <eq_expr 0x7f44bbe9dfa0 type <integer_type 0x7f44bc01c5e8 int>
                arg:0 <parm_decl 0x7f44bc804200 output_buf_len>
                arg:1 <integer_cst 0x7f44bc10bcc0 constant 1>
                ptest2.c:5:22 start: ptest2.c:5:7 finish: ptest2.c:5:25>
            arg:1 <statement_list 0x7f44bbe9e1e0 type <void_type 0x7f44bc01cf18 void>
                side-effects head 0x7f44bc10bcf0 tail 0x7f44bc10bd08 stmts 0x7f44bbe9dfc8 0x7f44bbe9e280

                stmt <modify_expr 0x7f44bbe9dfc8 type <integer_type 0x7f44bc01c3f0 char>
                    side-effects
                    arg:0 <indirect_ref 0x7f44bbe9e260 type <integer_type 0x7f44bc01c3f0 char>
                        arg:0 <parm_decl 0x7f44bc804180 output>
                        ptest2.c:5:36 start: ptest2.c:5:30 finish: ptest2.c:5:38>
                    arg:1 <integer_cst 0x7f44bc01f060 constant 0>
                    ptest2.c:5:40 start: ptest2.c:5:30 finish: ptest2.c:5:42>
                stmt <return_expr 0x7f44bbe9e280 type <void_type 0x7f44bc01cf18 void>
                    side-effects
                    arg:0 <modify_expr 0x7f44bbe9de60 type <integer_type 0x7f44bbe9cdc8 uint32_t>
                        side-effects arg:0 <result_decl 0x7f44bbecd000 D.3485> arg:1 <integer_cst 0x7f44bc10bc60 0>
                        ptest2.c:5:52 start: ptest2.c:5:52 finish: ptest2.c:5:52>
                    ptest2.c:5:52 start: ptest2.c:5:52 finish: ptest2.c:5:52>>
            ptest2.c:5:6 start: ptest2.c:5:6 finish: ptest2.c:5:6>
        stmt <decl_expr 0x7f44bbe9e2a0 type <void_type 0x7f44bc01cf18 void>
            side-effects arg:0 <var_decl 0x7f44bbed0000 output_len>
            ptest2.c:7:12 start: ptest2.c:7:12 finish: ptest2.c:7:21>
        stmt <decl_expr 0x7f44bbe9e2c0 type <void_type 0x7f44bc01cf18 void>
            side-effects arg:0 <var_decl 0x7f44bbed0090 prev_char_is_slash>
            ptest2.c:8:12 start: ptest2.c:8:12 finish: ptest2.c:8:29>
        stmt <bind_expr 0x7f44bbecadb0 type <void_type 0x7f44bc01cf18 void>
            side-effects vars <var_decl 0x7f44bbed0120 i>
            body <statement_list 0x7f44bbe9e1c0 type <void_type 0x7f44bc01cf18 void>
                side-effects head 0x7f44bc10bd98 tail 0x7f44bbed2300 stmts 0x7f44bbe9e2e0 0x7f44bbe9e180

                stmt <decl_expr 0x7f44bbe9e2e0 type <void_type 0x7f44bc01cf18 void>
                    side-effects arg:0 <var_decl 0x7f44bbed0120 i>
                    ptest2.c:10:17 start: ptest2.c:10:17 finish: ptest2.c:10:17>
                stmt <statement_list 0x7f44bbe9e180 type <void_type 0x7f44bc01cf18 void>
                    side-effects head 0x7f44bbed2378 tail 0x7f44bbed23f0 stmts 0x7f44bbe9ef80 0x7f44bbe9ef20 0x7f44bbecaab0 0x7f44bbecad80 0x7f44bbed10a0 0x7f44bbe9ef60 0x7f44bbecae10 0x7f44bbe9efc0

                    stmt <goto_expr 0x7f44bbe9ef80 type <void_type 0x7f44bc01cf18 void>
                        side-effects arg:0 <label_decl 0x7f44bc804400 D.3492>
                        ptest2.c:10:3 start: ptest2.c:10:3 finish: ptest2.c:10:5>
                    stmt <label_expr 0x7f44bbe9ef20 type <void_type 0x7f44bc01cf18 void>
                        side-effects arg:0 <label_decl 0x7f44bc804380 D.3491>>
                    stmt <cond_expr 0x7f44bbecaab0 type <void_type 0x7f44bc01cf18 void>
                        side-effects
                        arg:0 <eq_expr 0x7f44bbed1140 type <integer_type 0x7f44bc01c5e8 int>
                            readonly arg:0 <indirect_ref 0x7f44bbe9e380> arg:1 <integer_cst 0x7f44bc10bdc8 0>
                            ptest2.c:12:18 start: ptest2.c:12:9 finish: ptest2.c:12:21>
                        arg:1 <statement_list 0x7f44bbe9e400 type <void_type 0x7f44bc01cf18 void>
                            side-effects head 0x7f44bc10be40 tail 0x7f44bc10be70 stmts 0x7f44bbecaa80 0x7f44bbed1280 0x7f44bbe9e540
 stmt <cond_expr 0x7f44bbecaa80> stmt <modify_expr 0x7f44bbed1280> stmt <return_expr 0x7f44bbe9e540>>
                        ptest2.c:12:8 start: ptest2.c:12:8 finish: ptest2.c:12:8>
                    stmt <cond_expr 0x7f44bbecad80 type <void_type 0x7f44bc01cf18 void>
                        side-effects
                        arg:0 <ne_expr 0x7f44bbed1320 type <integer_type 0x7f44bc01c5e8 int>
                            arg:0 <var_decl 0x7f44bbed0090 prev_char_is_slash> arg:1 <integer_cst 0x7f44bc10bc60 0>
                            ptest2.c:17:9 start: ptest2.c:17:9 finish: ptest2.c:17:26>
                        arg:1 <statement_list 0x7f44bbe9e420 type <void_type 0x7f44bc01cf18 void>
                            side-effects head 0x7f44bbed21b0 tail 0x7f44bbed21c8 stmts 0x7f44bbecacf0 0x7f44bbed1cd0
 stmt <cond_expr 0x7f44bbecacf0> stmt <modify_expr 0x7f44bbed1cd0>>
                        arg:2 <cond_expr 0x7f44bbecad50 type <void_type 0x7f44bc01cf18 void>
                            side-effects arg:0 <eq_expr 0x7f44bbed1d70> arg:1 <modify_expr 0x7f44bbed1d98> arg:2 <statement_list 0x7f44bbe9e6c0>
                            ptest2.c:43:10 start: ptest2.c:43:10 finish: ptest2.c:43:10>
                        ptest2.c:17:8 start: ptest2.c:17:8 finish: ptest2.c:17:8>
                    stmt <postincrement_expr 0x7f44bbed10a0 type <integer_type 0x7f44bbe9cdc8 uint32_t>
                        side-effects arg:0 <var_decl 0x7f44bbed0120 i> arg:1 <integer_cst 0x7f44bc10bcc0 1>
                        ptest2.c:10:44 start: ptest2.c:10:43 finish: ptest2.c:10:45>
                    stmt <label_expr 0x7f44bbe9ef60 type <void_type 0x7f44bc01cf18 void>
                        side-effects arg:0 <label_decl 0x7f44bc804400 D.3492>>
                    stmt <cond_expr 0x7f44bbecae10 type <void_type 0x7f44bc01cf18 void>
                        side-effects
                        arg:0 <lt_expr 0x7f44bbed1078 type <integer_type 0x7f44bc01c5e8 int>
                            arg:0 <var_decl 0x7f44bbed0120 i> arg:1 <parm_decl 0x7f44bc804100 input_buf_len>
                            ptest2.c:10:26 start: ptest2.c:10:24 finish: ptest2.c:10:40>
                        arg:1 <goto_expr 0x7f44bbe9ef40 type <void_type 0x7f44bc01cf18 void>
                            side-effects arg:0 <label_decl 0x7f44bc804380 D.3491>>
                        arg:2 <goto_expr 0x7f44bbe9efa0 type <void_type 0x7f44bc01cf18 void>
                            side-effects arg:0 <label_decl 0x7f44bc804280 D.3489>>
                        ptest2.c:10:26 start: ptest2.c:10:24 finish: ptest2.c:10:40>
                    stmt <label_expr 0x7f44bbe9efc0 type <void_type 0x7f44bc01cf18 void>
                        side-effects arg:0 <label_decl 0x7f44bc804280 D.3489>>>>
            block <block 0x7f44bbed3000 used vars <var_decl 0x7f44bbed0120 i>
                supercontext <block 0x7f44bbed3060 used vars <var_decl 0x7f44bbed0000 output_len> supercontext <function_decl 0x7f44bbea3800 remove_slash> subblocks <block 0x7f44bbed3000>>>
            ptest2.c:10:3 start: ptest2.c:10:3 finish: ptest2.c:10:5>
        stmt <modify_expr 0x7f44bbed4000 type <integer_type 0x7f44bc01c3f0 char>
            side-effects
            arg:0 <indirect_ref 0x7f44bbe9eee0 type <integer_type 0x7f44bc01c3f0 char>
               
                arg:0 <pointer_plus_expr 0x7f44bbe9df28 type <pointer_type 0x7f44bc02a0a8>
                    arg:0 <parm_decl 0x7f44bc804180 output>
                    arg:1 <nop_expr 0x7f44bbe9eec0 type <integer_type 0x7f44bc01c000 sizetype>
                        arg:0 <var_decl 0x7f44bbed0000 output_len>>
                    ptest2.c:58:9 start: ptest2.c:58:9 finish: ptest2.c:58:9>
                ptest2.c:58:9 start: ptest2.c:58:3 finish: ptest2.c:58:20> arg:1 <integer_cst 0x7f44bc01f060 0>
            ptest2.c:58:22 start: ptest2.c:58:3 finish: ptest2.c:58:24>
        stmt <return_expr 0x7f44bbe9ef00 type <void_type 0x7f44bc01cf18 void>
            side-effects
            arg:0 <modify_expr 0x7f44bbed4028 type <integer_type 0x7f44bbe9cdc8 uint32_t>
                side-effects arg:0 <result_decl 0x7f44bbecd000 D.3485> arg:1 <var_decl 0x7f44bbed0000 output_len>
                ptest2.c:59:10 start: ptest2.c:59:10 finish: ptest2.c:59:19>
            ptest2.c:59:10 start: ptest2.c:59:10 finish: ptest2.c:59:19>> block <block 0x7f44bbed3060>
    ptest2.c:3:108 start: ptest2.c:3:108 finish: ptest2.c:3:108>

The function body is a bind_expr as usual. The block field on the second-to-last line is part of the bind_expr, not the statement_list. Each if statement is represented by a cond_expr, where arg:0 is the condition, arg:1 is the action if the condition is true, and arg:2 is the action if the condition is false (optional). The for loop is represented by an inner bind_expr. This inner body contains three labels with names D.3491, D.3492, and D.3489. They are declared at the beginning and end of the loop body respectively. When the loop begins, we first jump to D.3492 to check the condition i < input_buf_len. If this is true, we jump to D.3491 (beginning of loop body). Otherwise, we jump to D.3489 (end of bind_expr), which causes the loop to end.

The nop_expr node represents a cast (from uint32_t to sizetype) that is expected to be a no-op (see gcc/tree.def).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment