Lesson 2 in Lexer Generator

To move from a flat stream of tokens to a Syntax Tree, we need to define how tokens relate to each other hierarchically.

In Tree-sitter, the goal is to create a Concrete Syntax Tree (CST). Unlike an Abstract Syntax Tree (AST), which throws away "useless" characters like parentheses or semicolons, a CST keeps everything so that the code can be reconstructed exactly as it was written.

1. The Node Structure

In Zig, we want this to be memory-efficient. Instead of using pointers for every child (which causes cache misses), we can use an index-based approach.

pub const NodeId = u32;

pub const Node = struct {
    tag: NodeTag,
    token_range: struct { start: u32, end: u32 },
    first_child: ?NodeId = null,
    next_sibling: ?NodeId = null,
};

pub const NodeTag = enum {
    source_file,
    function_definition,
    parameter_list,
    block,
    // ...
};

By using first_child and next_sibling, we represent a tree using a linked-list structure within an array. This is exactly how high-performance parsers handle large files.

2. The Parsing Algorithm: Shift-Reduce

Tree-sitter uses a bottom-up approach. Imagine a conveyor belt of tokens coming from your Lexer. The parser has two main actions:

Shift: Take the next token from the Lexer and push it onto a Stack.
Reduce: Look at the top of the stack. If the tokens match a rule (e.g., identifier + l_paren + r_paren), pop them and push a new Parent Node (e.g., function_call) back onto the stack.

3. Implementation Sketch in Zig

Here is how you might represent the stack-based reduction in Zig:

pub const Parser = struct {
    lexer: Lexer,
    stack: std.ArrayList(NodeId),
    nodes: std.ArrayList(Node),
    allocator: std.mem.Allocator,

    pub fn parse(self: *Parser) !NodeId {
        while (true) {
            const token = self.lexer.next();
            if (token.tag == .eof) break;

            // 1. SHIFT: Create a leaf node for the token
            const leaf_id = try self.pushNode(.{
                .tag = .terminal, // Simplified
                .token_range = .{ .start = @intCast(token.loc.start), .end = @intCast(token.loc.end) },
            });
            try self.stack.append(leaf_id);

            // 2. REDUCE: Check if we can form a higher-level construct
            // (In a real Tree-sitter, this is guided by a State Table)
            try self.checkReductions();
        }
        return self.stack.pop(); // The root node
    }
};

4. The "Incremental" Twist

The reason Tree-sitter is special is how it handles changes. If you have a tree and you change one character:

It finds the smallest node that contains that character.
It "breaks" that node and its parents.
It re-runs the Shift-Reduce loop only on the broken parts, attempting to "re-use" the existing subtrees on the left and right that weren't touched.

Your Next Step in Zig

To truly understand this, try to implement a manual "Reduction" rule.

Exercise: Create a function reduceFunction() that looks at the stack. If it sees keyword_fn, an identifier, and a block, it should group them into a single function_definition node.

How comfortable are you with Zig's std.ArrayList and memory management? We'll need to be precise with allocations once we start building the full tree.

alogic0/lexer_2.md

Select an option

No results found