Skip to content

Instantly share code, notes, and snippets.

@alogic0
Last active April 18, 2026 06:32
Show Gist options
  • Select an option

  • Save alogic0/86655697c124a387fe42bab84eab2e99 to your computer and use it in GitHub Desktop.

Select an option

Save alogic0/86655697c124a387fe42bab84eab2e99 to your computer and use it in GitHub Desktop.
Parser generator

Tree-sitter is a high-speed, incremental parsing system that has revolutionized how editors like Neovim and VS Code handle syntax highlighting and code navigation. To understand its algorithms, you need to bridge the gap between traditional context-free grammars and modern incremental state management.

Here is your structured learning path to mastering the mechanics behind Tree-sitter.


Phase 1: The Theoretical Foundation

Before diving into Tree-sitter’s source code, you must understand the "Classical" way of parsing. Tree-sitter is based on LR(1) parsing, but with specific enhancements.

  • Context-Free Grammars (CFG): Understand Backus-Naur Form (BNF).
  • LR(1) Parsing: Learn how Shift-Reduce parsers work using a stack and a state machine.
  • The GLR Algorithm: Tree-sitter uses Generalized LR (GLR) to handle ambiguous grammars (like C++ or JavaScript) by splitting the parser into multiple "heads" when an ambiguity is encountered.

Phase 2: Incremental Parsing

This is the "secret sauce." Most parsers re-scan the entire file when you type a single character. Tree-sitter does not.

  • Subtree Reuse: Learn how the algorithm identifies which parts of the old Concrete Syntax Tree (CST) are unaffected by an edit.
  • The Versioned Tree: Study how nodes are marked as "dirty" and how the parser resumes from the closest valid state.
  • Complexity Goals: The goal is $O(\log n)$ for most edits, where $n$ is the number of tokens.

Phase 3: The Tree-sitter Implementation

Now, look at how the library actually handles the heavy lifting.

  • The Lexer vs. The Parser: Tree-sitter uses a generated C-based state machine.
  • External Scanners: Some languages (like Python with its indentation) aren't context-free. Learn how Tree-sitter uses "external scanners" to handle these edge cases.
  • Conflict Resolution: Study how prec (precedence) and left/right associativity are baked into the generated parse table.

Phase 4: Hands-on Exploration

The best way to understand the algorithm is to see it fail and succeed in real-time.

  1. The Playground: Use the Tree-sitter Playground to visualize how a tree changes as you type.
  2. Grammar Development: Write a simple grammar for a calculator or a "mini-JSON" using the grammar.js DSL.
  3. Inspect the C: Run tree-sitter generate and look at the parser.c file. It’s a massive array of integers representing the state machine—try to trace a single "Shift" action.

Recommended Reading & Resources

  • "Incremental Analysis in High-Level Editors": The seminal paper by Tim A. Wagner (the theoretical basis for these algorithms).
  • Tree-sitter Documentation: Specifically the "Creating Grammars" section.
  • The Source: Explore the tree-sitter/tree-sitter repository on GitHub, focusing on lib/src/parser.c.

Note: Don't get discouraged by the math in the GLR papers. Tree-sitter’s primary innovation isn't just the theory, but the highly optimized C implementation that makes it fast enough to run on every keystroke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment