Skip to content

Instantly share code, notes, and snippets.

@Jolg42
Forked from Aerijo/tree_sitter_guide.md
Created July 19, 2022 10:09
Show Gist options
  • Save Jolg42/096977cf0566e599a2ac937808b221fc to your computer and use it in GitHub Desktop.
Save Jolg42/096977cf0566e599a2ac937808b221fc to your computer and use it in GitHub Desktop.
Guide to writing your first Tree-sitter grammar

Guide to your first Tree-sitter grammar

NOTE: The Tree-sitter API and documentation has changed and improved since this guide was created. I can't guarantee this is up to date.

About

Tree-sitter is the new way Atom is providing language recognition features, such as syntax highlighting, code folding, autocomplete, and more. In contrast to TextMate grammars, which work by regex matching, Tree-sitter will generate an entire syntax tree. But more on that can be found in it's own docs.

Here, we look at making one from scratch.

Requirements

Building a Tree-sitter parser will require the following tools:

  • Node.js:
    • I recommend using nvm to install this. Doing so allows you to have multiple versions of node installed, which helps if you also have other JS projects.
    • The version we will use is the one Atom is currently using. As of writing this, it's 8.9.3 (determined by the Electron version Atom is using).
  • Build tools:
    • node-gyp. I believe the only requirements are those for installing this. It has instructions for Linux, macOS, and Windows. Once these dependencies are set up, install node-gyp with npm install -g node-gyp
    • For the dependencies on Windows, I ran the command npm install --global --production windows-build-tools using an administrator command prompt. It also required changing the owner of the generated directory to your user, as described here
  • IMPORTANT: The python command must be pointing to Python 2. node-gyp (used to build the parser) uses this to run it's Python scripts, and they are incompatible with Python 3.
  • A familiarity with JavaScript will help, but is not essential. Likewise for regular expressions. Tree-sitter allows regular expressions in it's rules, but only as a shorthand for explicit character combinations. It will not run the actual regular expression when parsing.

Setup

Quick check

  • Clone this grammar repo
    git clone https://github.com/Aerijo/tree-sitter-quick_check
    
  • Navigate to the root of this project in a terminal / command prompt
  • Run npm install
  • Add ./node_modules/.bin to your PATH
    • If you don't add this, prefix all tree-sitter commands with ./node_modules/.bin/; e.g., tree-sitter generate becomes ./node_modules/.bin/tree-sitter generate
    • Windows users need to replace forwards slashes with back slashes.
  • Run tree-sitter generate
  • Run node-gyp configure
  • Run node-gyp build
  • Run tree-sitter test

Detailed guide

Once you have installed all the required programs, we can start on setting up the package.

You should make a new file called package.json in the project root directory. Here's an example of what to put in it:

{
  "name": "tree-sitter-dummy",
  "version": "0.0.0",
  "description": null,
  "main": "index.js",
  "repository": null,
  "author": null,
  "license": null,
  "devDependencies": {
    "tree-sitter-cli": "^0.13.5"
  },
  "scripts": {
    "build": "tree-sitter generate && node-gyp build"
  }
}

Some notes:

  • By convention, the name should be tree-sitter-languageName. E.g., tree-sitter-javascript. Only word characters should be used in the name part (so [a-zA-Z0-9_]).
  • The file specified in main is automatically generated later and not a concern right now.
  • Don't forget to fill out the other fields when making your own.
  • If building for Atom, the version of tree-sitter-cli must be one Atom accepts.
    • IMPORTANT NOTE: Currently, the versions for stable and beta Atom are incompatible. Stable is ^0.12.5, beta is ^0.13.5. The official recommendation is build for beta, and eventually the versions will be the same as tree-sitter becomes stable.
  • The scripts field is just a shortcut definition for the build process. Just an FYI.

If you haven't already, open a new terminal / command prompt with the project root as the working directory. When you have the package.json file in place, run npm install. This will install all dependencies for generating the parser.

  • It may also make a package-lock.json file. I don't know the best way to deal with it, but deleting it works fine (for this tutorial at least).

Aditionally, make a file called grammar.js and a directory called corpus. Inside the corpus is where you will place the test files (explained later). For now, the project structure should look like this:

tree-sitter-dummy
|- corpus
|  ` (test files)
|- grammar.js
|- node_modules
|  ` (dependencies)
`- package.json

At this point, I recommend checking to see if everything is working. We will do this by cloning a project I prepared earlier.

Once we have a grammar ready, you have two options:

  • Add ./node_modules/.bin to your PATH. This will let you run tree-sitter generate like so when working from the project root directory. Run this now.
  • Call ./node_modules/.bin/tree-sitter generate directly. This is more cumbersome, but does not require any PATH changes. Also do this now.

When you execute this, it should generate several new files important for the next step. When it finishes, run node-gyp configure. The output should be a long list of info statements. If it throws an error, make sure you installed the requirements specified at the beginning. If it still doesn't work, comment below, or make a post on the discussion forum (preferred) and I'll try to help the best I can.

Finally, run node-gyp build. This will build the final parser, and (fingers crossed) that's it. To test the result, run the command tree-sitter test. If you copied over some tests into the corpus, it will run them and print the results.

Assuming everything worked, you can delete the clone and return back to the project we left off at.

Walkthrough

Writing a grammar takes time and patience. It will be slow, it will throw errors, it will do unexpected things. Hopefully, this walkthrough can minimise the issues and mistakes you make.

Housekeeping:

  • First of all, the official docs have a good overview of the various tools you have at your disposal. They also have a short guide on writing a parser, and explain everything I'm about to go into below. If you haven't read them yet, stop here and take a moment to go over them. You may find them sufficient to get started on your own.

  • The grammar is written completely* in JavaScript. So you should be at least somewhat familiar with JS syntax. If not, you can probably still pick it up enough to use from the steps and existing examples.

* It is possible to write custom behaviour in C++, but we don't need that (for this guide at least).

With that out of the way, open up the (currently blank) grammar.js file. Fill it out with the following:

module.exports = grammar({
  name: "biber",

  rules: {

    program: $ => repeat(/\w/)

  }

})

Here, module.exports = grammar({...}) can be considered "magic", and understanding is not required for writing the grammar (if you do, that's great).

  • The name property should be your languages name. In this guide, I'll be building a grammar for Biber.
    • Only word characters should be used (so [a-zA-Z0-9_]).
    • Non-word characters (such as - will break the generated parser files).
  • The rules property contains the definition of the grammar. The first entry in here is considered the start of the entire grammar. In this example, the "grammar" is any number of word characters.

How about we test this now? First, we can compile the grammar like we did in the setup section. The first time, run tree-sitter generate, then node-gyp configure, then node-gyp build. Subsequent times (if you set up the script field right in the package.json file) you can just run npm run build to have it do all the required steps with one command.

Now we need a test file. In corpus, make a new file called test.txtt. Any name is fine, and the extension doesn't matter. I use txtt because I have a syntax highlighting package for these test files that will target that filetype. In it, we will write tests of the following structure:

=====
Test name
=====

test body

---

(program)
  • The test starts with a heading wrapped in three or more equals signs.
  • Three or more dashes mark the end of the test body. Note that whitespace is preserved, but this shouldn't matter if your grammar ignores whitespace. We can leave it as is for now.
  • Below this divider, the expected parse tree is declared. For our simple grammar, it's just the root program node. Syntax will be explained later on.

Once you've written the test, run tree-sitter test. The output should be something like this:

  ♢ The biber language

  test
    ✓ First

✓ OK » 1 honored (0.007s)

Now that we have a working skeleton, lets add some rules.

With Biber, entries and commands start with @. Anything before this is considered a comment, or "junk". Let's add this in now.

module.exports = grammar({
  name: "biber",

  rules: {

    program: $ => repeat(choice(
      $._command_or_entry,
      $.junk,
      $.comment
    )),

    junk: $ => /[^%@\s\n\t][^%@]*/,

    comment: $ => token(seq('%', /.*/)),

    _command_or_entry: $ => "@" // TODO: handle command

  }

});

There's a lot to unpack here:

  • $ => is a normal JS arrow function with one parameter. The convention is to use $ as the parameter name, and you can see in program that we use this parameter to refer to other rules.
  • repeat means one or more of the argument can be matched. In this case, the argument is a choice.
  • choice means any of the arguments can be matched. There is no order to this, so any conflicts will have to be resolved a different way (we will get to this later).
  • The junk rule is a terminal value, represented using a regex. Remember that Tree-sitter does not use the regex directly, so this is just shorthand.
    • You can use normal escapes and operators, such as \n, ., *, +, etc.
    • You cannot use anchors or look around. So ^this$ does not work, nor (?=this)
    • In this example, we say anything that is not % or @ is junk, but I disallowed whitespace at the start (otherwise the matches are less useful).
  • The comment rule is made of
    • A seq function to combine rules sequentially.
    • A token function to make the parser treat the contents like a single "block". Apparently this can help performance.
  • Note I've left the _command_or_entry rule unfinished. It's recommended to write and test the grammar "breadth first", and fill in the details later. You should always recompile and test the changes often when you add new rules. Also be sure to write the actual tests for these rules.

Once you've updated your grammar to the above, compile with npm run build (if you set up that shortcut; else run tree-sitter generate followed by node-gyp build). Run tree-sitter test again, and you should see something like the folowing (but in colour)

  ♢ The biber language

  test
    ✗ Test name
        »
        actual expected

        (program (junk))
         // macros.js:14

✗ Broken » 1 broken (0.008s)

It's now picking up the test body as junk, and complaining that we didn't declare this in the expected parse tree. There are two ways of fixing this: the first being to manually update the parse tree. This is fine, and probably safer than the second option, but can be tedious when there are many discrepancies.

The second is to delete the entire parse tree in your test file, and run tree-sitter test again. This will throw the error, but you can just copy and paste the parse tree it gives you.

  • Note: always read over the tree and confirm it's what you expected. You will get into trouble later on if you just copy and paste, and didn't realise it was doing something completely different to what you wanted.

General tips

These are some tips I haven't put in the walkthrough part of the guide (yet)

  • Absolutely never use rules that allow a zero-width match. Tree-sitter will do it's best to detect this and throw errors if it finds one, but it's not perfect. If you have such a rule, parsing may hang in an infinite loop of zero-width matches.

    • The only exception is the start rule; you can use repeat directly there.
    • If you want to have something like this (note that repeat is a potential zero width match)
    array: $ => seq('[', $.optional_entries, ']'),
    optional_entries: $ => repeat($.entry)

    try the following instead

    array: $ => seq('[', optional($.entries), ']'),
    entries: $ => repeat1($.entry)

    Note that repeat1 requires at least one match, so it is no longer zero-width. In the second example, we never have a zero-width rule.

  • Remember to always use seq when writing sequential rule components. It is easy to forget this when inside another rule, e.g. like repeat

    myRule1: $ => seq("[", repeat($.first, $.second, $.third), "]"), // NO
    myRule2: $ => seq("[", repeat(seq($.first, $.second, $.third)), "]") // YES

Testing

I talk about writing tests in the above sections. Here I'll go into some finer points

  • Everything I say here can be inferred from the test program itself
  • In addition to corpus, the name grammar_test can also be used for the tests folder. corpus will take precedence over grammar_test though.
  • Folders can be nested as mush as you like.
  • All files will be interpreted as tests.
    • If you use the extension txtt, you can use the tests grammar I wrote to make tests easier to read. I plan to also accept tst (tree-sitter test), or you could associate your own extension.
  • Tests do not support comments. However, any text before the first test header will be ignored.
  • Test names must be wrapped in three or more equals signs, with a single newline between the parts:
    ======
    Valid test name
    ======
    

Integrating with Atom (WIP)

Tree-sitter is editor agnostic, but chances are you're reading this for Atom. Here are some steps to get it linked to a grammar package. The following assumes you've already got an existing package you want to add Tree-sitter support to, but I'll add steps to do it from scratch eventually.

  • Publish the Tree-sitter parser to npm.
    • This step is not required for local use, but will be if you want to publish. For local use (e.g., developing the Atom package simultaneously with the grammar) you can use npm link
  • In the Atom package package.json file register this module as a dependency
  • In the Atom package grammars folder make a new grammar (it's standard to call it tree-sitter-name.cson)
  • Fill it out with the properties here
    • name is what will appear in the grammar selector and status bar

    • scopeName is applied as the root scope of everything

    • type declares it as a Tree-sitter grammar (it's important to get this one right)

    • parser declares the name of the npm module that provides the parser (the one you built and linked / added to npm)

    • fileTypes is an array of extensions to apply the grammar to.

      • If two grammars get the same score for a file, the Tree-sitter one will be chosen (assuming the other was a TextMate grammar)
    • injectionRegExp is a regular expression to allow this grammar to be inserted inside of other grammars.

      • I believe it only works for inserting into other Tree-sitter grammars, at least for now.
    • folds and comments are self explanatory, and I'm not sure about the syntax of folds yet

      • The end can be omitted from a comment. Whatever it is, it will be used for the comment line shortcut.
    • scopes translates syntax nodes into CSS selectors. The selectors are then used for syntax highlighting. See the JS grammar for examples on the syntax here.

      • Mainly left side node name turns into right side CSS selector. However, you can also match literal characters ("{"), require certain chains (parent > child), and even use regex to get finer detail than the parser provides.
    • [If testing your own installation] Run npm install.

    • NOTE: Look into using prebuild to automate a build and distribution process when new releases are published. Users without build tools installed will be unable to build the Tree-sitter parser, so setting up CI to upload binaries targeting all common platforms to GitHub will allow prebuild to grab these instead. This means users without access to build tools can easily use the package, and will save installation times.

      A guide on this is in progress.

External scanners (WIP)

  • External scanners let you write C++ directly to parse a section

  • Make the file src/scanner.cc to use one

  • Familiarity with C / C++ pretty much required (or you're going to get familiar very quickly)

  • See the HTML grammar for an example of how it uses this to match tag names

  • This file must declare 5 functions:

    • void *tree_sitter_name_create()
    • void tree_sitter_name_destroy(void *payload)
    • bool tree_sitter_name_scan(void *payload, TSLexer *lexer, bool *valid_symbols)
    • unsigned tree_sitter_name_serialize(void *payload, char *buffer)
    • unsigned tree_sitter_name_deserialize(void *payload, char *buffer, unsigned length)
  • tree_sitter_name_create makes the external scanner and returns a pointer to it

  • tree_sitter_name_destroy handles cleaning up the scanner, and freeing all memory it's holding. payload is a pointer to the scanner you made (the one returned by create).

  • tree_sitter_name_scan is the main function used to work out what the node is. It returns true when the match is successful, and false when it fails. Note that it is the one function for all external scans.

    • valid_symbols is an array of bools representing the rules declared as external. These should be declared in an enum in the scanner file. This is how you know what context your scanner is being used in; note that multiple tags may be possible, so be sure to account for all possibilities as needed.
    • lexer provides information about the "cursor" location and various other properties. Your scanner will interact with this to move along and mark the end of states. In particular, lexer->lookahead tells you what the next character is, lexer->advance moves to the next one, and lexer->mark_end coupled with lexer->result_symbol to mark node boundaries and categorise the contents.
  • tree_sitter_name_serialize is used to save the current state at any point in time. What this state is is up to the implementation of the scanner, but should be quick to perform and sufficiently detailed for full retrieval. For this, you're provided a buffer with size TREE_SITTER_SERIALIZATION_BUFFER_SIZE bytes. The function should return a value representing the amount of the buffer it used.

  • tree_sitter_name_deserialize takes the scanner, the buffer passed to the serialising function, and the number returned by it (which should represent the amount of buffer used). This should alter the scanner to be equivalent to how it was before the serialisation.

  • TIPS:

    • You don't get much room for serialisation (1024 bytes). For something like matching HTML tags, you want to define a list of common tags. This way, deeper serialisation will be more likely to be possible.
    • I don't have a tip for handling running out of room right now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment