To move from "manual reductions" to a real Tree-sitter-like algorithm, we have to introduce the LR Stack. In a bottom-up parser, the stack doesn't just hold nodes; it holds States.
In Zig, we can represent this by combining our NodeId with a StateId. This tells the parser: "I am currently in the middle of a function definition, and I just saw an identifier. What do I expect next?"
The "State" is an integer that refers to a row in a giant transition table (the Parse Table). In Tree-sitter, this table is generated from your grammar.js.
const StateId = u32;
const StackEntry = struct {
state: StateId,
node: NodeId,
};
pub const Parser = struct {
stack: std.ArrayList(StackEntry),
// ... other fields
};The heart of the algorithm is a loop that looks at the Current State (top of the stack) and the Current Token (from the lexer). It then looks up an action in a table:
- Shift to State X: Push the current token and move to state X.
-
Reduce by Rule Y: Pop
$N$ nodes from the stack, wrap them in a new parent node, and push that parent back. - Accept: The parsing is complete.
- Error: The syntax is invalid.
Tree-sitter is a Generalized LR (GLR) parser. This is what makes it powerful enough for complex languages like C++.
If the table says "I could Shift OR I could Reduce" (a Conflict), Tree-sitter doesn't crash. It splits the stack.
In Zig, this means instead of one std.ArrayList(StackEntry), you have a list of "Heads":
const Parser = struct {
heads: std.ArrayList(*StackHead),
// ...
};
const StackHead = struct {
last_entry: StackEntry,
parent: ?*StackHead, // Links back to form a Graph-Structured Stack (GSS)
};If two versions of the "truth" end up in the same state later, Tree-sitter merges the stacks back together. This is the Graph-Structured Stack (GSS).
For your implementation in Zig, try to visualize the stack not as a simple array, but as a Singly Linked List where nodes can have multiple "children" but only one "parent" (pointing back toward the start of the file).
Why do it this way?
- Memory: When you split a stack, you don't copy the whole history; you just create two new "Head" pointers pointing to the same parent.
- Concurrency: You can process multiple interpretations of the code simultaneously.
Update your Parser to use a basic state machine.
- Create a simple
enum State { start, saw_fn, saw_name, complete }. - Write a function that takes
(current_state, token_tag)and returns thenext_state. - Try to parse
fn mainby transitioning fromstart$\rightarrow$ saw_fn$\rightarrow$ saw_name.
This is the bridge to Incrementalism. If we know exactly what state we were in at byte 500, and the user changes byte 600, we can potentially jump straight back into that state without re-parsing the first 500 bytes.
How does the concept of "splitting the stack" feel to you—does the linked-list approach make sense for managing those parallel paths?