Skip to content

Instantly share code, notes, and snippets.

@alogic0
Created April 20, 2026 08:46
Show Gist options
  • Select an option

  • Save alogic0/5bfd20820de084cd81778285fa7fbd45 to your computer and use it in GitHub Desktop.

Select an option

Save alogic0/5bfd20820de084cd81778285fa7fbd45 to your computer and use it in GitHub Desktop.
Lesson 2 in Lexer Generator

To move from a flat stream of tokens to a Syntax Tree, we need to define how tokens relate to each other hierarchically.

In Tree-sitter, the goal is to create a Concrete Syntax Tree (CST). Unlike an Abstract Syntax Tree (AST), which throws away "useless" characters like parentheses or semicolons, a CST keeps everything so that the code can be reconstructed exactly as it was written.

1. The Node Structure

In Zig, we want this to be memory-efficient. Instead of using pointers for every child (which causes cache misses), we can use an index-based approach.

pub const NodeId = u32;

pub const Node = struct {
    tag: NodeTag,
    token_range: struct { start: u32, end: u32 },
    first_child: ?NodeId = null,
    next_sibling: ?NodeId = null,
};

pub const NodeTag = enum {
    source_file,
    function_definition,
    parameter_list,
    block,
    // ...
};

By using first_child and next_sibling, we represent a tree using a linked-list structure within an array. This is exactly how high-performance parsers handle large files.


2. The Parsing Algorithm: Shift-Reduce

Tree-sitter uses a bottom-up approach. Imagine a conveyor belt of tokens coming from your Lexer. The parser has two main actions:

  1. Shift: Take the next token from the Lexer and push it onto a Stack.
  2. Reduce: Look at the top of the stack. If the tokens match a rule (e.g., identifier + l_paren + r_paren), pop them and push a new Parent Node (e.g., function_call) back onto the stack.

3. Implementation Sketch in Zig

Here is how you might represent the stack-based reduction in Zig:

pub const Parser = struct {
    lexer: Lexer,
    stack: std.ArrayList(NodeId),
    nodes: std.ArrayList(Node),
    allocator: std.mem.Allocator,

    pub fn parse(self: *Parser) !NodeId {
        while (true) {
            const token = self.lexer.next();
            if (token.tag == .eof) break;

            // 1. SHIFT: Create a leaf node for the token
            const leaf_id = try self.pushNode(.{
                .tag = .terminal, // Simplified
                .token_range = .{ .start = @intCast(token.loc.start), .end = @intCast(token.loc.end) },
            });
            try self.stack.append(leaf_id);

            // 2. REDUCE: Check if we can form a higher-level construct
            // (In a real Tree-sitter, this is guided by a State Table)
            try self.checkReductions();
        }
        return self.stack.pop(); // The root node
    }
};

4. The "Incremental" Twist

The reason Tree-sitter is special is how it handles changes. If you have a tree and you change one character:

  1. It finds the smallest node that contains that character.
  2. It "breaks" that node and its parents.
  3. It re-runs the Shift-Reduce loop only on the broken parts, attempting to "re-use" the existing subtrees on the left and right that weren't touched.

Your Next Step in Zig

To truly understand this, try to implement a manual "Reduction" rule.

  • Exercise: Create a function reduceFunction() that looks at the stack. If it sees keyword_fn, an identifier, and a block, it should group them into a single function_definition node.

How comfortable are you with Zig's std.ArrayList and memory management? We'll need to be precise with allocations once we start building the full tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment