To move from a flat stream of tokens to a Syntax Tree, we need to define how tokens relate to each other hierarchically.
In Tree-sitter, the goal is to create a Concrete Syntax Tree (CST). Unlike an Abstract Syntax Tree (AST), which throws away "useless" characters like parentheses or semicolons, a CST keeps everything so that the code can be reconstructed exactly as it was written.
In Zig, we want this to be memory-efficient. Instead of using pointers for every child (which causes cache misses), we can use an index-based approach.
pub const NodeId = u32;
pub const Node = struct {
tag: NodeTag,
token_range: struct { start: u32, end: u32 },
first_child: ?NodeId = null,
next_sibling: ?NodeId = null,
};
pub const NodeTag = enum {
source_file,
function_definition,
parameter_list,
block,
// ...
};By using first_child and next_sibling, we represent a tree using a linked-list structure within an array. This is exactly how high-performance parsers handle large files.
Tree-sitter uses a bottom-up approach. Imagine a conveyor belt of tokens coming from your Lexer. The parser has two main actions:
- Shift: Take the next token from the Lexer and push it onto a Stack.
- Reduce: Look at the top of the stack. If the tokens match a rule (e.g.,
identifier+l_paren+r_paren), pop them and push a new Parent Node (e.g.,function_call) back onto the stack.
Here is how you might represent the stack-based reduction in Zig:
pub const Parser = struct {
lexer: Lexer,
stack: std.ArrayList(NodeId),
nodes: std.ArrayList(Node),
allocator: std.mem.Allocator,
pub fn parse(self: *Parser) !NodeId {
while (true) {
const token = self.lexer.next();
if (token.tag == .eof) break;
// 1. SHIFT: Create a leaf node for the token
const leaf_id = try self.pushNode(.{
.tag = .terminal, // Simplified
.token_range = .{ .start = @intCast(token.loc.start), .end = @intCast(token.loc.end) },
});
try self.stack.append(leaf_id);
// 2. REDUCE: Check if we can form a higher-level construct
// (In a real Tree-sitter, this is guided by a State Table)
try self.checkReductions();
}
return self.stack.pop(); // The root node
}
};The reason Tree-sitter is special is how it handles changes. If you have a tree and you change one character:
- It finds the smallest node that contains that character.
- It "breaks" that node and its parents.
- It re-runs the Shift-Reduce loop only on the broken parts, attempting to "re-use" the existing subtrees on the left and right that weren't touched.
To truly understand this, try to implement a manual "Reduction" rule.
- Exercise: Create a function
reduceFunction()that looks at the stack. If it seeskeyword_fn, anidentifier, and ablock, it should group them into a singlefunction_definitionnode.
How comfortable are you with Zig's std.ArrayList and memory management? We'll need to be precise with allocations once we start building the full tree.