Phase 4: Incrementalism and the "Dirty" Range

Now we reach the defining feature of Tree-sitter: Incremental Parsing. In a traditional parser, if a user has a 10,000-line file and types a single }, you re-parse the whole file. In Tree-sitter, you only re-parse what changed.

To do this in Zig, we need to introduce Edit Mapping and Node Reuse.

1. The Edit Structure

When a user types, we receive an "edit" event. This isn't just the new string; it's the range of bytes that were replaced.

pub const Edit = struct {
    start_byte: usize,
    old_end_byte: usize,
    new_end_byte: usize,
};

When an edit occurs, we don't throw away the old tree. Instead, we "walk" the tree and mark every node that intersects with this byte range as dirty.

2. Subtree Reuse (The "Gold" of the Algorithm)

During the next parse, the parser maintains a "cursor" on the old tree and a "current position" in the new buffer.

Before the Lexer even runs, the Parser asks: "Is there a node in the old tree at this exact position that is NOT dirty and matches what I need right now?"

Check Position: Does the old node's start_byte match my current new_pos?
Check State: Was the parser in the same StateId (from Phase 3) when it originally created this node?
Action: If yes, Shift the entire subtree onto the new stack and skip its byte length. You just saved the work of parsing 1,000 lines.

3. Implementation in Zig: The Reusable Node Lookup

You can implement a simple version of this by keeping your nodes in a sorted list or a spatial tree (like a Gap Buffer or a simple array of ranges).

pub fn findReusableNode(self: *Parser, current_pos: usize, current_state: StateId) ?NodeId {
    // 1. Look in the old tree for a node at current_pos
    const node_id = self.old_tree.findNodeAt(current_pos) orelse return null;
    const node = self.old_tree.nodes.get(node_id);

    // 2. If the node is marked dirty (overlaps with an Edit), we can't use it
    if (node.is_dirty) return null;

    // 3. Critically, the Parse Table must agree that this 'Symbol' 
    // is valid in the current StateId.
    if (!self.parse_table.isValidTransition(current_state, node.tag)) return null;

    return node_id;
}

4. The "Lookahead" Problem

There is one trap: The Lookahead. If a node was originally parsed because the lexer looked at the next character to decide where to stop, and that character changed, the node might now be invalid even if the edit was "outside" the node.

Tree-sitter solves this by storing the max_lookahead_extent for every node. When you edit, you don't just mark nodes at the edit site as dirty; you mark nodes whose "lookahead" reached into the edit site.

Your Exercise: The "No-Op" Parser

Try to simulate an edit in Zig:

Take a successfully parsed tree of fn main() {}.
"Edit" the space between fn and main.
Write a function that iterates through your std.ArrayList(Node) and marks any node whose token_range overlaps the edited index as "invalid".
Run your parser again and see if it can "skip" re-parsing the main() {} block by simply copying the old NodeIds.

Why Zig is Perfect for This

Because Tree-sitter relies on heavy "pointer-chasing" through trees, Zig’s ability to use Data-Oriented Design (storing nodes in contiguous arrays/slices) makes the "reuse" check incredibly fast compared to managed languages.

Does the logic of "Reusing a subtree only if the StateId matches" make sense? This is the bridge between the State Machine we built in the last chapter and the efficiency of a real-world editor.

alogic0/lexer_4.md

Select an option

No results found

Select an option

No results found