- Atom is transitioning to an entirely new way of defining grammars using
tree-sitter. This will be enabled by default quite soon now. It is theoretically faster and more powerful than regex based grammars (the one described in this guide), but requires a steeper learning curve. My understanding is that regex based grammars will still be supported however (at least until version 2), so this guide can still be useful. To enable it yourself, go to Settings -> Core and checkUse Tree Sitter Parsers
Links for tree-sitter help:
tree-sitter: the main repotree-sitter-cli: converts a JavaScript grammar to the required C/C++ filesnode-tree-sitter: module to use Tree-sitter parsers in NodeJS- My guide on starting a Tree-sitter grammar
In Atom, syntax highlighting is a two part job: the language package gives a scope to every character in the file, while the user's syntax theme tells the editor which colour each scope should be.
Themes are not the topic of this gist. To learn how to write a theme, I suggest starting at the flight manual.
Instead, this guide will be on how to write a language grammar. Specifically, a TextMate type grammar. It is intended for complete novices, who might have the crazy idea that something like this could be fun and/or easy, and those who want to remind themselves of what they can do. If you're reading this and you notice I've missed something, or I get something wrong, please don't hesitate to leave a comment. The more people sharing their knowledge and experience, the better.
Right now, I don't feel like the guide is finished. Rather, I felt I needed to get what I had written uploaded before something terribly wrong and unpredictable happens to the file I'm writing on.
Here I've compiled a list of sites I used when writing my first language grammar. Some of these may not be intended for beginners, so think of them as a "second" step to look at when you don't get something here, or want to change things up.
- This amazing guide: could not have finished my own package without this. It's worth reading, trust me.
- TextMate Section 12: what the spec for Atom's rules is based on. Uses JSON instead of CSON, but the structure should be the same.
- DamnedScholar's gist: a template with the accepted keys, and a short comment on their function.
- Flight manual grammars entry: The official docs.
- Any of the existing language packages for major languages. Python, JavaScript, HTML, and more.
- regex101: a tool to test regex patterns. You need to convert between regular expressions defined here and ones used in regex101, as there are twice as many backslashes in the grammar rules. Also, the exact regex engine Atom uses is not available. Any of the options should do for most general cases, but there are differences in ability and syntax of the different engines.
- oniguruma: the regex engine Atom uses. Use this to learn the specific syntax available to you.
first-mate: the package Atom uses to tokenize each line. Not necessary for writing a grammar, but a good technical reference if you want to know what's happening behind the scenes.
You might like a basic understanding of the CSON data format. Knowing about JSON might help too. However, knowledge of either is not required to get started. Hopefully though, as you start to use it more, you will come to understand the formats if you don't already. I use the terms object, array, and string frequently, so you should understand what they are at a conceptual level at least.
A quick summary:
object: the fundamental data structure in JavaScript and JSON (JavaScript Object Notation). It is a set of key-value pairs, where accessing the object's key returns the corresponding value. In CSON (CoffeeScript Object Notation), objects are represented as follows
key: 'value'
name: 'your name'
age: 8
pets: [ # an array of pets
'cat'
'dog'
'bird'
]
nestedObject:
nestedKey: 'nestedValue'
otherKey: 'more data'array: seen in the above example, an array is an ordered list of values. They are denoted by square brackets, and must be comma separated if the values are on the same line. Objects in an array must be separated by using{}brackets, as will be seen later on.string: represents a set of characters. Denoted by quotation marks (single or double) surrounding some text. Most, if not all, end values will be strings (end, as in when the value is not itself an object or array).
Never heard of regular expressions? Me neither. Turns out, they're pretty useful. And essential to writing the grammar rules. (and can be used with Atom's finder if the Use Regex button is active)
I'll give out a quick rundown here, but you really need to use the provided links to better familiarise yourself with what they are and how to write and test them.
- https://www.regular-expressions.info/quickstart.html
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
- https://www.icewarp.com/support/online_help/203030104.htm
- https://regex101.com/ (use this to test them)
First, the concept: A regular expression (regex) is a group of characters that represents a "pattern" of text. It can be used to search a larger body of text for matches, and (when programming) each match can be passed to functions and handled as desired. In our case, we use regex to search for matches that are then passed to Atom's internals, to be tokenized and processed for the syntax theme to apply colours to.
A basic regex (using JavaScript syntax) might look like the following:
/hello/Later on, we'll see that we actually use strings to define ours, so it'll look more like
"hello"For now though, let's examine what search patterns this rule matches.
A general rule of thumb is that all letters are exact matches. Therefore, our above rule will find all instances of the letters h, e, l, l, o appearing consecutively in a body of text.
Here's a question: where are the matches in the following body of text?
Hello to you, Othello, and hello to you too, Iago!
Note that (by default) regex are case sensitive (so no match for Hello) and do not respect word boundaries (so a match in Othello).
Now, for what makes regex so useful: special characters. There are many of these in regex. A few are as follows, but a proper regex guide should be used to learn them.
.(a decimal point) matches any character*(a star) match any number of the preceding token?(a question mark) match between 0 or 1 of the preceding token\(backslash) changes the behaviour of the following character. Used with punctuation, it will form a literal punctuation mark. Used with a letter, it will normally make a special meaning.
Using these special characters, more advanced search patterns can be created. For example:
/((\\)(?:\w*[rR]ef\*?))(\{.*?\})/Hopefully, you're now comfortable with reading and writing regular expressions. If not, don't worry too much. You can always go to regex101 and test something you don't understand.
If you completely don't understand regular expressions, or how they are useful, this will be a major hurdle. It is not a stretch to say that regular expressions are the backbone of a language grammar.
You can mostly just follow the flight manual's creating a grammar section for this. The rest of the tutorial will be for creating a grammar for the (fictional) example language.
- Note: following the atom guide will link the package to the
devpackage directory. This means your package will only be loaded when in development mode. If you wish to make it active in a normal window, navigate to the package directory in the command line and run the commandapm link
You should have a package folder, which contains the following directory structure (but with example replaced by your language's name):
language-example
|-- grammars
| `-- example.cson
`-- package.json
And inside package.json:
{
"name": "language-example",
"version": "0.0.0",
"description": "An example language grammar package",
"repository": "https://github.com/user/package-name",
"keywords": [
"syntax",
"highlighting",
"grammar"
],
"license": "MIT",
"bugs": "https://github.com/user/package-name/issues",
"engines": {
"atom": ">=1.0.0 <2.0.0"
}
}example.cson should be blank at this point.
Are you writing this for a popular language that already has a grammar package? If so, it is likely there will be several other packages that rely on the scopes provided by the language package (spell check & autocomplete, to name a couple). These packages use the scopes for contextual information, allowing them to be smarter and more "aware" of the language. If you decide to use a nonstandard set of scopes, you risk breaking compatibility with these other packages. When deciding on new scope names, it is better to use the preexisting ones in an established grammar package rather than coming up with your own.
Additionally, these packages rely on the grammar package being active to hook their own activation. This means that you will need to sort out the package activation hooks on a case by case basis.
There are several similar terms to describe aspects of the grammar package.
This section walks you through setting up a basic grammar, with minimal rules. For more advanced features and rules, see the next section.
The top of your examples.cson file should have the following entries
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false-
scopeName: this key determines the root scope for all characters in a document using this grammar. The convention is to usesource.<language_identifier>, where the language identifier is a unique, short word. For example, the core packages usesource.pythonandsource.jsfor Python and JavaScript. However, there exists an additional convention where text based languages get the root scopetext.<...>. This means HTML gets scoped totext.html.basic, and LaTeX (currently) totext.tex.latex. When in doubt, just usesource.<...>. -
name: this is the entry that will appear in the language selection menu. It is purely aesthetic, but should simply be the language's name. -
fileTypes: an array of file extensions that are used to determine if a given file should use this grammar. This lets Atom automatically select the correct grammar when the user opens a file. -
limitLineLength: a Boolean value to tell the tokenizer whether or not to "give up" on long lines. If true, the tokenizer will only look at a maximum number of characters per line, and completely ignore the rest. This can lead to incorrect pattern matching, especially in text like language where paragraphs are present. Setting it to false effectively forces the tokenizer to look at the whole line, and apply the rules to everything.
There are more available properties, but they will be introduced in the intermediate section. For now, these properties will be sufficient.
Below the above entries, make a new key called patterns. It's value is an array of objects, which will each hold the information for a search pattern.
patterns: [
{
# rule #1
}
{
# rule #2
}
{
# rule #3
}
# etc.
]Now, we'll look at making a specific rule.
The basic outline for a single line matching rule is as follows:
{
comment: 'Use this to explain the function of the rule, if necessary'
name: 'comment.line.example'
match: '#.*$'
}Some things to note:
- The scope name should follow one of the ones given in the TextMate manual. This is to maximise the chances that a syntax theme will have a corresponding rule to colour that scope. The final part of the scope should be the language name (the one set in
scopeNameat the top). - The
matchkey holds the regex that defines the search pattern. It is a string, which means all backslashes must be escaped with another backslash. Therefore, to match a literal\, the normal regex for which is\\, one must use\\\\(sigh... did I mention my first package was a grammar one for LaTeX?). Also important to note is that amatchwill only work on one line. Even if you have a\\ninside the regex, it will not work.
If you're following along (which may be a good idea) you can start playing with this rule. Below, I'll give the current full contents of my example.cson file. I will not do this much, as you should be able to insert and maintain a list of rules yourself now.
# grammars/example.cson
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false
patterns: [
{
comment: 'Use this to explain the function of the rule, if necessary'
name: 'comment.line.example'
match: '#.*$'
}
]Opening a new file and setting the grammar to your new one, paste in the following and you should see the # and subsequent characters are a comment.
Normal text # comment
Now, onto a more complicated match. Try the following rule:
{
match: '(\\*)(.*?)(\\*)'
captures:
0:
name: 'meta.bold.example'
1:
name: 'punctuation.definition.bold.example'
2:
name: 'markup.bold.example'
3:
name: 'punctuation.definition.bold.example'
}This introduces the captures key; it's value is an object with keys corresponding to the capture groups of the match regex. Each of these keys then also has an object value, with the key name (which is like the name key in the rule above). What this does is allow different scopes to be applied to different parts of the same match. For this rule, it is applying meta.bold.example to everything (capture 0), but additionally applying punctuation.definition.bold.example to the * delimiters (captures 1 & 3) and markup.bold.example to the (arbitrary) contents of the second capture group. Note that captures: 0: is equivalent to using the name key in this case.
- If you don't know what I mean by capture group, remember the section on regex? Where I told you to learn regular expressions? I wasn't kidding.
Before I continue, I'm going to show the "condensed" form of the same rule. I prefer it, as it wastes fewer lines on useless things like capture group numbers. A more detailed explanation is given in intermediate tips.
{
match: '(\\*)(.*?)(\\*)'
captures:
0: name: 'meta.bold.example'
1: name: 'punctuation.definition.bold.example'
2: name: 'markup.bold.example'
3: name: 'punctuation.definition.bold.example'
}By now, you should be able to make some basic rules for your grammar. But what if you need to match across several lines? You want the begin and end keys.
{
name: 'meta.section.example'
contentName: 'markup.other.section.example'
begin: '((\\\\)section)(\\{)'
beginCaptures:
1: name: 'support.function.section.example'
2: name: 'punctuation.definition.function.example'
3: name: 'punctuation.definition.begin.example'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.end.example'
}Some new keys:
name: as with amatchrule, thenamekey applies to the entire match, including the text captured by thebeginandendpatterns in this case.contentName: applies the scope to the text between, but not including, thebeginandendcaptures.begin: the pattern that defines when the rule begins.beginCaptures: much likecapturesin amatchrule, but only applies to the text captured bybegin.end: the pattern that defines when the rule ends.endCaptures: likebeginCaptures, but for theendtext.
When you try this one, you might notice a distinct lack of colour. Maybe the \section part is coloured, but nothing else is (using one dark theme at least). Lining up the cursor with a spot you want to check, running the command Editor: Log Cursor Scope will show the scopes have indeed been applied. This demonstrates the divide between grammar and theme perfectly; the scopes have all been applied, but they are not coloured because the theme ignores them. Bear in mind that scopes are not solely for themes though, and some themes may use these seemingly useless scopes. As the grammar author, it's your job to provide as much information as possible about the file, by scoping accurately.
Another thing you might have noticed is that our other rules don't work inside of the section rule (and if you were experimenting, you'd have found they don't work inside if the bold match rule either). Basically, everything from the first to last character captured by a given rule is independent from the other rules in the main patterns array. To apply rules to the captured text, we need to make a patterns array inside the current rule. This patterns array behaves much like the outside one, except the rules it contains are only applied to the text between the begin and end captures of the rule it's in.
{
name: 'meta.section.example'
contentName: 'markup.other.section.example'
begin: '((\\\\)section)(\\{)'
beginCaptures:
1: name: 'support.function.section.example'
2: name: 'punctuation.definition.function.example'
3: name: 'punctuation.definition.begin.example'
end: '\\}'
endCaptures:
0: 'punctuation.definition.end.example'
patterns: [{
name: 'comment.line.example'
match: '#.*$'
}]
}An important behaviour to observe now is what happens if one of these inside pattern rules is not finished when the end pattern could be matched? Try the following to find out.
\section{ this is a section # }
is this still a section? }
How about now?
Here's a step by step overview of what happened:
- The text
\section{is matched as the beginning of the rule - The tokenizer started looking for the
endpattern, or any matches to the rules in the localpatternsarray. - The tokenizer matched the
# }at the end of the first line with the comment rule inpatterns, effectively hiding the first}. - The tokenizer continued looking for
patternrule matches orendmatches when the comment rule ended (the end of the line in this case). - It sees the
}on the second line and matches it with theendpattern. - The rule is finished, so the final line has no special scopes.
But what if we actually wanted the rules in the main patterns array to be active inside a begin/end rule? For this, there is the includes key. It takes the name of a rule defined in the repository (explained later), and pretends that rule was actually there. In this case, where we want it to match the main patterns array, we would use one of two values:
$self: (note this is not a regex) this value refers to the current grammar. That is, the context it's used in will have the rules in the mainpatternsarray applied to it.$base: similar to$self, but with some differences when embedded in another grammar. Not important right now, but just remember that$baseis not the same as$selfwhen your grammar is embedded in another.$selfpoints to the grammar$selfappears in (points to itself), whereas$basepoints to the base language of the file, which could be anything. If you don't know what I mean by embedded, don't use$base.
Right now, your example.cson file should look something like this:
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false
patterns: [
{
comment: 'Use this to explain the function of the rule, if necessary'
name: 'comment.line.example'
match: '#.*$'
}
{
match: '(\\*)(.*?)(\\*)'
captures:
0: name: 'meta.bold.example'
1: name: 'punctuation.definition.bold.example'
2: name: 'markup.bold.example'
3: name: 'punctuation.definition.bold.example'
}
{
name: 'meta.section.example'
contentName: 'markup.other.section.example'
begin: '((\\\\)section)(\\{)'
beginCaptures:
1: name: 'support.function.section.example'
2: name: 'punctuation.definition.function.example'
3: name: 'punctuation.definition.begin.example'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.end.example'
patterns: [{ include: '$self' }]
}
]Try it out on the following text:
Normal text # comment
* bold # text * <- not commented
\section{
text
# comment
* bo-#-ld * <- still not commented
text
}
text
Can you see what needs to be done to get comments working in a bold match? Did it work? Why not?
Remember, the match pattern will only ever work on a single line (the tokenizer only looks at one line at a time; it literally doesn't see anything else). To get comments working in the bold rule, and get the bold rule to work across multiple lines, it needs to be converted to a begin/end rule as follows:
{
name: 'meta.bold.example'
contentName: 'markup.bold.example'
begin: '\\*'
beginCaptures:
0: name: 'punctuation.definition.bold.example'
end: '\\*'
endCaptures:
0: name: 'punctuation.definition.bold.example'
patterns: [{ include: '$self' }]
}And so concludes the beginners section of the guide. With the tools above, you should be able to produce a grammar of reasonable complexity. What follows are some tips for intermediate authors, for additional features and best practices.
A feature mentioned above, but not explained, is the repository. For a grammar of any reasonable size, the repository is vital to help organise your rules.
To make it, add the repository key after the main patterns array. It's value is an object, so do not add brackets after it. For example:
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false
patterns: [{ include: '#lineComment' }]
repository:
lineComment: {
comment: 'This is a rule object, with the same abilities as any other'
name: 'comment.line.example'
match: '#.*$'
}
secondRule: {
...
}
thirdRule: {
...
}In the above example, a rule with the name lineComment has been added to the repository. Note that rules in the repository are not automatically applied. They must be include'd inside the main patterns array, or into another rule's child patterns array. To properly refer to this rule, the include key must have the value '#lineComment' as it does in the example. The rule itself is also a valid choice, and it will recursively apply itself as much as possible. This recursion also occurs with $self, but self refers to the entire grammar, not just that specific rule.
I'm of the opinion that all rules should be added to the repository. Then, they can be activated as desired by including them in the patterns array. I also like to group sets of rules into "meta" patterns, that are made up almost entirely of other include'd rules. This allows you to form customised sets of rules that can be applied consistently, without repetition. For an example of this, see my end result from when I tried writing one. It's not perfect, and I could probably do with following my own advice in some parts. Overall though, I'm reasonably happy with it. You'll notice I use comments to help organise the repository; this is another thing you should do to help yourself and others when trying to understand what you've done.
Another feature alluded to above, when talking about $self vs $base. Basically, other grammars can embed your grammar into their rule set, and vice versa. When embedding another grammar, you need to use include: 'source.language' (where source.language is the root scope of the target language; watch out for the text. versions). For example
{
begin: '```'
end: '```'
patterns: [{ include: 'source.js' }]
}will scope the contents of a three back-ticks pair to JavaScript.
$base is important for embedded grammars, as it points to the file's root grammar. This means that if your grammar is embedded into another, e.g., a markdown grammar, $base will point to the markdown grammar, not yours. Sometimes this behaviour is desirable, and is used extensively in the C family of grammars.
One thing to be wary of is leakage: this occurs when a scope from the embedded grammar has not been closed, and it prevents your rule from seeing the end pattern. This is highly likely when the user will only be writing a portion of a code snippet, where there might be an opening brace but no closing one.
This can be seen using the rule above. In a file, leave an unmatched { in the JavaScript section. Now, instead of picking up the back-ticks as matching the end pattern, it will instead be interpreted as a JavaScript string. From there, all scoping that follows will likely be broken.
Currently, there is no solution to this problem (that I am aware of). Hopefully a key will be added that makes the end pattern more important, so it will be checked first before all others.
As a grammar author, you need to consider this from the other side too, i.e., thinking about others embedding your grammar. For every begin rule you add, it should have an end. Sometimes, less is more. If a match rule works just as well, use the match.
The condensed form introduced at the beginning is a good thing to use, but here I'll go through some style tips I developed after trying to read some grammars (both my own and others) myself.
- Keep the
patternsarray clean. If there are a lot of rules building up in one, consider moving them to the repository andincludeing them. This keeps the active rule set clutter free, while succinctly expressing the function or intention of each rule. - When there is only one entry in an object or array, keep it all on one line. For example, compare below. While the first may look more spaced out and easier to read, the second really helps when scrolling through a long list of rules. It's effectively halved the number of lines, while representing the exact same rule.
# Spread out
{
begin: '((\\\\)texttt)\\s*(\\{)'
beginCaptures:
1:
name: 'support.function.texttt.latex'
2:
name: 'punctuation.definition.function.latex'
3:
name: 'punctuation.definition.arguments.begin.latex'
end: '\\}'
endCaptures:
0:
name: 'punctuation.definition.arguments.end.latex'
contentName: 'markup.raw.texttt.latex'
patterns: [
{
include: '$self'
}
]
}
# Condensed
{
begin: '((\\\\)texttt)\\s*(\\{)'
beginCaptures:
1: name: 'support.function.texttt.latex'
2: name: 'punctuation.definition.function.latex'
3: name: 'punctuation.definition.arguments.begin.latex'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.arguments.end.latex'
contentName: 'markup.raw.texttt.latex'
patterns: [{ include: '$self' }]
}- When there are multiple entries, break them across new lines. The spacing in this case (of the curly brackets column) helps with recognising when a group is together, and when it is separate. For example
# With multiple entries...
patterns: [ # multiple pattern entries, so new line
{ # multiple entries, so it gets a new line
comment: 'Handles all types of comments'
include: '#commentMeta'
}
{ include: '#escapedCurlyBracket' } # single entry, so no newline
{ include: '#metaOpenBrace' } # another single entry
]
# With one entry...
patterns: [{ # only one entry, so no new line between the [ and the {
match: 'blah' # the rule object has multiple entries, so a new line is required after the {
name: 'blah'
...
}]- The braces are optional when the array has one entry. I like them, so I use them. It makes it easier to add additional rules though, so think carefully before omitting them.
- Be consistent. Worse than using any one style is using an inconsistent mixture, and making the reader think about the format of what they are reading.
- Don't use quotation marks for the key names. I haven't used them in this guide, but you will likely see some packages that do. To the best of my knowledge, these quotation marks do not contribute anything. In fact, they actively detract comprehension because syntax highlighting themes will make everything the same colour. By contrast, with unquoted form using
language-coffee-scriptandone-darktheme, I see key names as red, numbers as orange, and strings as green.
In addition to the name, patterns, repository, etc. properties, there are some others that are recognised
firstLineMatch: a regex string that assists Atom's automatic language selector. The selector generates a score for each language based on the file's extension and contents (so it was also usingfileTypesbehind the scenes). To
It is possible to include specific rules from another grammar's repository: simply use the syntax
{ include: 'source.example#ruleName' }When the # character is not first, the part before it is taken as the grammar scope name. The part after is then read as the repository name, like with internal include statements.
The actual function determining this behaviour is here.
Here I'll talk about valid scope names, and some good practices.
- Note: Dynamic scope names do not play friendly with some packages that depend on reading the scope. For example,
linter-spellwill only accept a list of absolute scopes to blacklist, making it incompatible with this style of scoping. Caching of values based on scope (e.g., for autocomplete) will also be negatively affected by dynamic scoping. This guide presents them in a "this is possible" sense, rather than "you should do this".
One feature I haven't mentioned at all yet is using the capture groups in scope names and other parts of the regex pattern. This is possible using $n and \\n (not newline!) notation, where n is the capture group number. For example, using $n in scope names
{
name: 'support.function.section.latex'
begin: '((\\\\)(section|paragraph|part|chapter)(\\*)?)(?=[^a-zA-Z@])'
beginCaptures:
# v
1: name: 'entity.name.section.$3.latex'
2: name: 'punctuation.definition.function.latex'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.end.latex'
patterns: [{ include: '$self' }]
}And using \\n in a regex match
{
name: 'string.function.verbatim'
# v
match: '\\\\verb([^a-zA-Z])(.*?)(?:(\\1)|$)'
captures:
0: name: 'support.function.vebatim.latex'
1: name: 'punctuation.latex'
2: name: 'markup.raw.verbatim.latex'
3: name: 'punctuation.latex'
}Some rules I've observed when experimenting with backreferencing:
- Attempting to use
\\nin a scope name results in the errorinvalid backref number/namethrown byfirst-mate. Only the$ncan be used here. - The opposite is also impossible;
$is an active character in regex, so$nwill never match with the single line matching we are restricted to. Only\\ncan be used here. - If the
$ncapture group does not exist, it becomes a normal scope name. E.g., the scope would become literallysupport.function.$50.example, and not a reference to the 50th capture group. An empty match still counts as a match, and if this happens the scope would becomesupport.function..example(note the double.). nameandcontentNamecan only use the capture groups in thebeginregex. Attempting to use higher numbers does not result in overflowing to theendcapture groups.beginCapturesandendCaptureswill only use the capture groups inbeginandendregular expressions respectively. There is no way to use a value captured in abegingroup as a scope name in anendscope, and vice versa.\\nonly refers to capture groups in thebeginregex. It can be used in theendregex, but will not refer to the end capture groups. Nor will it overflow to start meaningendcapture groups if the number is higher than that of the number of capture groups in thebeginregex.matchbehaves as if it were abeginkey, for the purposes of this numbering.\\nonly works for up to the number of capture groups there are. If there are less than nine capture groups, and\\9is used, an error will be thrown. For numbers higher than 9, no errors are emitted, but that rule will not work.oniguruma(the regex engine Atom uses) provides alternative syntax for\\nmatches:\\k<n>, wherenis any integer (e.g., use\\k<2>for the second capture group). For more on the syntax, see the oniguruma docs. This verbose syntax doesn't work in scope names either.\\0refers to the entirebeginmatch.- Scopes probably shouldn't have punctuation in the sections, so make sure you don't just put arbitrary text in. E.g., use
([a-zA-Z\\d]*)as opposed to(.*). I ran into some bizarre errors when I had punctuation in the scope names, but they are difficult to reproduce (and I've forgotten the original cause).
If you have anything to add to this list, please leave a comment. I want to make this an exhaustive list of whats possible and impossible with backreferencing.
Yes, it's possible. Try the following.
{
contentName: 'keyword.example'
begin: '\\-\\s*(.*?)\\s*\\-'
beginCaptures:
1:
name: 'markup.heading.example'
patterns: [{
name: 'constant.character.example'
match: 'b(.*?)b'
captures:
1:
name: 'markup.italic.example'
patterns: [{
match: 'c'
name: 'support.function.example'
}]
}]
end: '-'
}Try it with this! Check the scopes too with Editor: Log Cursor Scope
1. - a b c d b c - hello -
2. - b c d b c - hello -
3. - a c d b c - hello -
4. - a b d b c - hello -
5. - a b c b c - hello -
6. - a b c d c - hello -
7. - a b c d b - hello -
Notice that the main rule will always match if the begin regex works. It is also immune to the patterns applied to the capture groups; any matches in this way will be isolated to the captured group and will not leak into the rest of the main rule. Not even if another begin pattern is used in the capture group (which is allowed).
For this particular rule, the first capture group is given the scope markup.heading.example. This is done with name, which is how the capture group has always been scoped in previous examples. What's new is the patterns array that is also in the capture group object. It's entry, a single rule (multiple are allowed; it's just like any other patterns array), attempts to find a pair of b characters. If it succeeds, it will then attempt to match a c character between the b's.
Applying this to the example text, we get:
- A complete match:
a b c d b cis scoped as a headingb c d bsatisfies the 1st capture group pattern, and is further scoped asconstant.character.example- The first
csatisfies the 1st capture group pattern of this sub pattern, and is further scoped assupport.function.example. The secondcis not between theb's, so it is ignored.
- The first
- Another complete match. The initial
awas never a required part. - Only matches the initial
beginpattern, as the match'b(.*?)b'is not satisfied. Because of this, thecpattern is never even looked at. - The
bpattern is matched, so thecpattern is looked for, but there are noc's within theb's, so it is ultimately ignored. - Another complete match, similar to the first and second. The missing
dwas not a required part of any pattern. - Similar to the third, as the
bpattern cannot be completed (there is no secondb). - Similar to the second, except it's the second
cthat is missing and not thea.
You should experiment yourself with nesting rules to see what happens. If you observe any quirky or unexpected effects not mentioned here, please leave a comment explaining how to reproduce and I'll add an explanation to this section.
Introducing a new property: injections. This one sits at the root level of your grammar file, much like scopeName and patterns. It's value is an object, who's keys are scope selectors. The value of each key is another object, who's sole property is a patterns array. This patterns array has the same form as any other shown in this guide, and can be considered functionally the same. All other properties are ignored.
The purpose of injections are to provide patterns based on scope rather than includeing or nesting them. I found the best way to explain them was by example: consider the PHP grammar provided by language-php. It actually provides two grammar files: one for PHP syntax (php.cson), and a wrapper one for HTML syntax (html.cson). The pure PHP grammar does not get applied to any files automatically, as it lacks the fileTypes and firstLineMatch keys. Instead, the provided HTML grammar is applied to various PHP related files. This grammar sets the root scope name to text.html.php, and provides two rules: a new comment, and the entire text.html.basic grammar (provided by language-html by default).
What is special though, is the injections property it contains. Whenever the scope contains text.html.php (which is the root scope), and none of the scopes start with meta.embedded or meta.tag, it will attempt to match the given patterns, in this case being the php-tag rule in the repository. If these rules match, only then will the pure PHP grammar be inserted via an include statement.
Some technical notes:
- Only the active grammar's injections are applied in a file. Injections in other grammars are not considered.
- Matches to injected patterns will be looked for last, after the active grammar and any
injectionSelectoradded grammars. This can be influenced with scope prefixes though. - IMPORTANT: The
injectionSelectorcauses a bug where any grammar with one will not automatically apply itself to an opened file. For a workaround, use the following style in an independent CSON file:
injectionSelector: 'source.embedded.latex' # when this scope is present in another grammar, inject this grammar
scopeName: 'source.embedded.latex'
patterns: [{ include: 'text.tex.latex' }]- I will follow up on this after some testing, but I believe
injectionsin a grammar that has been inserted via aninjectionSelectorshould work.
To be added
To be added. For now, see my forum question.
This section looks at how the cson file is converted into a grammar by first-mate, and how the grammar is used to apply scopes to each character in the file. It will however focus more on what you as a package author can do, rather than the exact steps the first-mate package takes to apply scopes to everything.
To see the source code in it's final form, you need to extract .atom/.apm/first-mate/<version>/package.tgz. This has been transpiled from CoffeeScript to JavaScript, and built using Grunt, so the files and directory structure will be different to the online source code. When writing this, I found it easiest to read the CoffeeScript version, while keeping in mind that the actual paths are those in the transpiled version.
- Note that JavaScript doesn't have classes, but I'll call them that because the objects are all defined using the class syntax.
A short summary of every recognised property of every construct you as a package author has access to.
In the root level of the grammar file:
namefileTypesscopeNamefoldingStopMarker(not used)maxTokensPerLinemaxLineLengthlimitLineLengthinjectionsinjectionSelectorpatternsrepositoryfirstLineMatch
In a normal patterns array object:
namecontentNamematchbeginendpatternscapturesbeginCapturesendCapturesapplyEndPatternsLastincludepopRulehasBackReferencesdisabled
First of all, require("first-mate") provides three classes:
ScopeSelectorGrammarRegistryGrammar
Their behaviour are given in the following sections, as well as that of the other classes they depend on.
I don't know much about this one. It appears to be responsible for the scopes, and the specs show it supports some interesting syntax. Much of it is generated from a PEG.js file though, so it will be difficult to understand without additional knowledge.
If someone can explain prefixes and injections, that would be appreciated.
Atom automatically creates and populates a GrammarRegistry instance. This class is available as atom.grammars, so definitely look at it in dev tools. For a specific grammar (next section) use atom.grammars.getGrammarForScopeName("<scope_name>").
It's job is to hold a group of grammars together, and provide helper functions. The steps taken to add grammars from packages are roughly as follows:
-
At some time or another, the method
loadGrammarsof thePackageclass is run for each package. This looks inside the package for agrammarfolder, and if present it will attempt to find any.csonor.jsonfiles in there. -
The method
readGrammaruses theseasonpackage to read in the.csonand.jsonfiles in thegrammardirectory. Note that this means you could write a grammar inJSONformat. It throws an error if the object is invalid, and then checks that thescopeNameproperty is a non-empty string (throwing an error if it isn't). -
It calls the method
createGrammar, with the arguments of the file path and the newly formed object. This method sets themaxTokensPerLineandmaxLineLengthproperties of the object (if not already present; if they are, the existing values are used instead). Additionally, it will then check if the object has thelimitLineLengthproperty. Iffalse, it will setmaxLineLengthtoInfinity, regardless of earlier steps. -
It makes a
new Grammarwith the arguments of the global registry itself (this) and the (slightly modified) object. The section on grammar object creation is below. -
When the grammar object returns, the method
createGrammarcontinues and sets the propertypathto the original file path, returning the grammar object. -
When this is returned, the
loadGrammarsmethod of thePackageclass continues. It sets the propertypackageNameof the grammar object to the package's name, and the propertybundledPackageto the packagesbundledPackageproperty value. Finally, it pushes the grammar object to the array of grammars provided by that package. It also runs the functiongrammar.activate(), which pushes the grammar to the global registry.
This is where a single set of rules for a given language is defined. In the walk through above, it is called with the global grammar registry (which was passed in via the Package class) and a slightly modified version of the CSON file.
Immediately, the GrammarRegistry it is called with is added as the property registry to the grammar object. Additionally, some select properties are looked at in the CSON file object. The following are directly added as properties to the grammar object:
-
name: explained in beginner guide. Used as a human friendly label in the language selection window. -
fileTypes: explained in beginner guide. Array of file extensions used to score grammars against a given file. If not provided, and empty array will be automatically created. -
scopeName: explained in beginner guide. The root scope applied to all characters, regardless of pattern matches. -
foldingStopMarker: will be covered in intermediate tips. Seems more or less useless for now. -
maxTokensPerLine: the maximum number of rules that will be applied per line. Potentially added by thecreateGrammarmethod in the above walk through if not already set. -
maxLineLength: the maximum number of characters that will be tokenized per line. When internally set toInfinity, it will have no limit. Potentially added or modified by thecreateGrammarmethod in the above walk through. Setting toInfinitydirectly in the grammar file results in an error and the grammar will not load.
The following are also recognised properties of the file object, but are processed somewhat before being added to the grammar object:
-
injections: the grammar propertyinjectionsis set to the result of anew Injectionscalled with the grammar object and theinjectionsproperty of the file object. TheInjectionsclass is addressed in another section. Basically, it is set of scopes, and rules that are applied when the scope is reached. They only work when the grammar is the active one. For when it's not active,injectionSelectormust be used. For more on injections, see advanced tips. -
injectionSelector: scopes to insert the grammar into when they occur in another grammar. For example,language-hyperlinkuses'text - string.regexp, string - string.regexp, comment, source.gfm'(the-sign means the following scope is not allowed; so it injects intext, but not when also instring.regexp). The grammar property is defined as the result of callingnew ScopeSelectoron the file object'sinjectionSelectorproperty. If this is not defined, a value ofnullis used. -
patterns: the grammar object'srawPatternsproperty is directly set the thepatternsproperty of the file object. -
repository: the grammar object'srawRepositoryproperty is directly set the therepositoryproperty of the file object. -
firstLineMatch: if defined, it is used to create a new (oniguruma) regex object, which is then made the value of the grammar object'sfirstLineRegexproperty.
Finally, there are some properties of the grammar object that are created directly in the construction of the new grammar:
-
emitter: an emitter is a JavaScript object that can be used to time function execution based on events emitted by the emitter. -
repository: initialised tonull, but the methodgetRepository(not called during construction) sets it to a rule set using information in therawRepositoryproperty, which in turn was set by the file object'srepository. -
initialRule: initialised tonull, but the methodgetInitialRule(called during tokenization) sets it to@createRule({@scopeName, patterns: @rawPatterns})if it doesn't already have a value (that's notnull). It is the first set of rules that will be checked. -
includedGrammarScopes: initialised to an empty array. When a separate grammar is included by this grammar, it's scope name is added to this array.
Also bear in mind that some properties were added by the GrammarRegistry and Package classes when formed that way.
When the grammar object (from here: grammar) is constructed, it is not quite ready. At some point, the method tokenizeLines is called on the text of a file. This splits the text by line (\n character) and passes each one to the tokenizeLine method, each time passing in the ruleStack variable from the previous line.
In tokenizeLine, it does the following:
-
First, it checks for a long line (as determined by the
maxLineLengthproperty), cutting it down if needed. -
It then converts the line into an
OnigString, which is written inCand presumably makes the following steps quicker. -
Now it checks if the
ruleStackis notnull. When called bytokenizeLines, the first call will have thisnullvalue.-
If not
null: theruleStackis copied (shallow) and thescopeNameandcontentScopeNameproperties of each object it contains are pushed to another array, if they exist. -
If
null: it executes the methodgetInitialRule, which sets the grammar'sinitialRuleproperty to anew Rule, called with the grammar object (this), it'sGrammarRegistry, and the options object{@scopeName, patterns: @rawPatterns}. -
It pulls the
scopeNameandcontentScopeNameproperties frominitialRule, and sets these values as the first object in aruleStackarray. The rule stack keeps track of all currently active rules, and will only try to match the latest rule to be added to the stack.
-
Further behaviour can be determined by reading the source code directly.
This class handles the patterns that are applied based on scope, when this grammar is the active one.
Properties:
-
grammar: the grammar class it was called with. -
injections: an array of objects, with the propertiesselectorandpatterns.selectoris aScopeSelectorclass instance, formed from the the key of each property in the grammar file'sinjectionsvalue.patternsis the same as ever, with at least the top levelincludestatements resolved. I'll need to test if nestedincludestatements are also found properly. -
scanners: an object that is presumably to hold instances of theScannerclass.
A Rule is an object with some metadata properties and an array of Patterns. The Rule itself does not match any text. This is for the Patterns, and the special endPattern.
Perhaps the most relevant class to a package author, besides the grammar class. This is the one that holds the regex and other properties you add.
A hidden property: disabled is checked when creating a new Rule
Immediately added properties:
-
grammar: the grammar class it was called with -
registry: the grammar registry class it was called with -
include: a reference to a rule stored in the grammar repository -
popRule: whether a match with this pattern should remove the rule it is a part of from the rule stack. -
hasBackReferences: overrides automatic detection; ifnull, the class will instead check for back references using/\\\d+/on thematch.
Grammar file properties that are processed:
-
name: this is used to make@scopeName, which is then passed to@grammar.createRule() -
contentName: same asname, but the internal property iscontentScopeName -
match: ifendis a property orpopRuleistrue, and it has backreferences (either set explicity or detected by a quick regex), the grammar will set this value to the property@match. If not, it is set to the property@regexSource -
begin: (this is only looked at ifmatchdoesn't exist).@regexSourceis set to thebeginvalue.