- Atom is transitioning to an entirely new way of defining grammars using
tree-sitter
. This will be enabled by default quite soon now. It is theoretically faster and more powerful than regex based grammars (the one described in this guide), but requires a steeper learning curve. My understanding is that regex based grammars will still be supported however (at least until version 2), so this guide can still be useful. To enable it yourself, go to Settings -> Core and checkUse Tree Sitter Parsers
Links for tree-sitter
help:
tree-sitter
: the main repotree-sitter-cli
: converts a JavaScript grammar to the required C/C++ filesnode-tree-sitter
: module to use Tree-sitter parsers in NodeJS- My guide on starting a Tree-sitter grammar
In Atom, syntax highlighting is a two part job: the language package gives a scope to every character in the file, while the user's syntax theme tells the editor which colour each scope should be.
Themes are not the topic of this gist. To learn how to write a theme, I suggest starting at the flight manual.
Instead, this guide will be on how to write a language grammar. Specifically, a TextMate type grammar. It is intended for complete novices, who might have the crazy idea that something like this could be fun and/or easy, and those who want to remind themselves of what they can do. If you're reading this and you notice I've missed something, or I get something wrong, please don't hesitate to leave a comment. The more people sharing their knowledge and experience, the better.
Right now, I don't feel like the guide is finished. Rather, I felt I needed to get what I had written uploaded before something terribly wrong and unpredictable happens to the file I'm writing on.
Here I've compiled a list of sites I used when writing my first language grammar. Some of these may not be intended for beginners, so think of them as a "second" step to look at when you don't get something here, or want to change things up.
- This amazing guide: could not have finished my own package without this. It's worth reading, trust me.
- TextMate Section 12: what the spec for Atom's rules is based on. Uses JSON instead of CSON, but the structure should be the same.
- DamnedScholar's gist: a template with the accepted keys, and a short comment on their function.
- Flight manual grammars entry: The official docs.
- Any of the existing language packages for major languages. Python, JavaScript, HTML, and more.
- regex101: a tool to test regex patterns. You need to convert between regular expressions defined here and ones used in regex101, as there are twice as many backslashes in the grammar rules. Also, the exact regex engine Atom uses is not available. Any of the options should do for most general cases, but there are differences in ability and syntax of the different engines.
- oniguruma: the regex engine Atom uses. Use this to learn the specific syntax available to you.
first-mate
: the package Atom uses to tokenize each line. Not necessary for writing a grammar, but a good technical reference if you want to know what's happening behind the scenes.
You might like a basic understanding of the CSON
data format. Knowing about JSON
might help too. However, knowledge of either is not required to get started. Hopefully though, as you start to use it more, you will come to understand the formats if you don't already. I use the terms object
, array
, and string
frequently, so you should understand what they are at a conceptual level at least.
A quick summary:
object
: the fundamental data structure in JavaScript and JSON (JavaScript Object Notation). It is a set of key-value pairs, where accessing the object's key returns the corresponding value. In CSON (CoffeeScript Object Notation), objects are represented as follows
key: 'value'
name: 'your name'
age: 8
pets: [ # an array of pets
'cat'
'dog'
'bird'
]
nestedObject:
nestedKey: 'nestedValue'
otherKey: 'more data'
array
: seen in the above example, an array is an ordered list of values. They are denoted by square brackets, and must be comma separated if the values are on the same line. Objects in an array must be separated by using{}
brackets, as will be seen later on.string
: represents a set of characters. Denoted by quotation marks (single or double) surrounding some text. Most, if not all, end values will be strings (end, as in when the value is not itself an object or array).
Never heard of regular expressions? Me neither. Turns out, they're pretty useful. And essential to writing the grammar rules. (and can be used with Atom's finder if the Use Regex
button is active)
I'll give out a quick rundown here, but you really need to use the provided links to better familiarise yourself with what they are and how to write and test them.
- https://www.regular-expressions.info/quickstart.html
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
- https://www.icewarp.com/support/online_help/203030104.htm
- https://regex101.com/ (use this to test them)
First, the concept: A regular expression (regex) is a group of characters that represents a "pattern" of text. It can be used to search a larger body of text for matches, and (when programming) each match can be passed to functions and handled as desired. In our case, we use regex to search for matches that are then passed to Atom's internals, to be tokenized and processed for the syntax theme to apply colours to.
A basic regex (using JavaScript syntax) might look like the following:
/hello/
Later on, we'll see that we actually use strings to define ours, so it'll look more like
"hello"
For now though, let's examine what search patterns this rule matches.
A general rule of thumb is that all letters are exact matches. Therefore, our above rule will find all instances of the letters h
, e
, l
, l
, o
appearing consecutively in a body of text.
Here's a question: where are the matches in the following body of text?
Hello to you, Othello, and hello to you too, Iago!
Note that (by default) regex are case sensitive (so no match for Hello
) and do not respect word boundaries (so a match in Othello
).
Now, for what makes regex so useful: special characters. There are many of these in regex. A few are as follows, but a proper regex guide should be used to learn them.
.
(a decimal point) matches any character*
(a star) match any number of the preceding token?
(a question mark) match between 0 or 1 of the preceding token\
(backslash) changes the behaviour of the following character. Used with punctuation, it will form a literal punctuation mark. Used with a letter, it will normally make a special meaning.
Using these special characters, more advanced search patterns can be created. For example:
/((\\)(?:\w*[rR]ef\*?))(\{.*?\})/
Hopefully, you're now comfortable with reading and writing regular expressions. If not, don't worry too much. You can always go to regex101 and test something you don't understand.
If you completely don't understand regular expressions, or how they are useful, this will be a major hurdle. It is not a stretch to say that regular expressions are the backbone of a language grammar.
You can mostly just follow the flight manual's creating a grammar section for this. The rest of the tutorial will be for creating a grammar for the (fictional) example
language.
- Note: following the atom guide will link the package to the
dev
package directory. This means your package will only be loaded when in development mode. If you wish to make it active in a normal window, navigate to the package directory in the command line and run the commandapm link
You should have a package folder, which contains the following directory structure (but with example
replaced by your language's name):
language-example
|-- grammars
| `-- example.cson
`-- package.json
And inside package.json
:
{
"name": "language-example",
"version": "0.0.0",
"description": "An example language grammar package",
"repository": "https://github.com/user/package-name",
"keywords": [
"syntax",
"highlighting",
"grammar"
],
"license": "MIT",
"bugs": "https://github.com/user/package-name/issues",
"engines": {
"atom": ">=1.0.0 <2.0.0"
}
}
example.cson
should be blank at this point.
Are you writing this for a popular language that already has a grammar package? If so, it is likely there will be several other packages that rely on the scopes provided by the language package (spell check & autocomplete, to name a couple). These packages use the scopes for contextual information, allowing them to be smarter and more "aware" of the language. If you decide to use a nonstandard set of scopes, you risk breaking compatibility with these other packages. When deciding on new scope names, it is better to use the preexisting ones in an established grammar package rather than coming up with your own.
Additionally, these packages rely on the grammar package being active to hook their own activation. This means that you will need to sort out the package activation hooks on a case by case basis.
There are several similar terms to describe aspects of the grammar package.
This section walks you through setting up a basic grammar, with minimal rules. For more advanced features and rules, see the next section.
The top of your examples.cson
file should have the following entries
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false
-
scopeName
: this key determines the root scope for all characters in a document using this grammar. The convention is to usesource.<language_identifier>
, where the language identifier is a unique, short word. For example, the core packages usesource.python
andsource.js
for Python and JavaScript. However, there exists an additional convention where text based languages get the root scopetext.<...>
. This means HTML gets scoped totext.html.basic
, and LaTeX (currently) totext.tex.latex
. When in doubt, just usesource.<...>
. -
name
: this is the entry that will appear in the language selection menu. It is purely aesthetic, but should simply be the language's name. -
fileTypes
: an array of file extensions that are used to determine if a given file should use this grammar. This lets Atom automatically select the correct grammar when the user opens a file. -
limitLineLength
: a Boolean value to tell the tokenizer whether or not to "give up" on long lines. If true, the tokenizer will only look at a maximum number of characters per line, and completely ignore the rest. This can lead to incorrect pattern matching, especially in text like language where paragraphs are present. Setting it to false effectively forces the tokenizer to look at the whole line, and apply the rules to everything.
There are more available properties, but they will be introduced in the intermediate section. For now, these properties will be sufficient.
Below the above entries, make a new key called patterns
. It's value is an array of object
s, which will each hold the information for a search pattern.
patterns: [
{
# rule #1
}
{
# rule #2
}
{
# rule #3
}
# etc.
]
Now, we'll look at making a specific rule.
The basic outline for a single line matching rule is as follows:
{
comment: 'Use this to explain the function of the rule, if necessary'
name: 'comment.line.example'
match: '#.*$'
}
Some things to note:
- The scope name should follow one of the ones given in the TextMate manual. This is to maximise the chances that a syntax theme will have a corresponding rule to colour that scope. The final part of the scope should be the language name (the one set in
scopeName
at the top). - The
match
key holds the regex that defines the search pattern. It is a string, which means all backslashes must be escaped with another backslash. Therefore, to match a literal\
, the normal regex for which is\\
, one must use\\\\
(sigh... did I mention my first package was a grammar one for LaTeX?). Also important to note is that amatch
will only work on one line. Even if you have a\\n
inside the regex, it will not work.
If you're following along (which may be a good idea) you can start playing with this rule. Below, I'll give the current full contents of my example.cson
file. I will not do this much, as you should be able to insert and maintain a list of rules yourself now.
# grammars/example.cson
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false
patterns: [
{
comment: 'Use this to explain the function of the rule, if necessary'
name: 'comment.line.example'
match: '#.*$'
}
]
Opening a new file and setting the grammar to your new one, paste in the following and you should see the #
and subsequent characters are a comment.
Normal text # comment
Now, onto a more complicated match. Try the following rule:
{
match: '(\\*)(.*?)(\\*)'
captures:
0:
name: 'meta.bold.example'
1:
name: 'punctuation.definition.bold.example'
2:
name: 'markup.bold.example'
3:
name: 'punctuation.definition.bold.example'
}
This introduces the captures
key; it's value is an object with keys corresponding to the capture groups of the match
regex. Each of these keys then also has an object value, with the key name
(which is like the name
key in the rule above). What this does is allow different scopes to be applied to different parts of the same match. For this rule, it is applying meta.bold.example
to everything (capture 0
), but additionally applying punctuation.definition.bold.example
to the *
delimiters (captures 1
& 3
) and markup.bold.example
to the (arbitrary) contents of the second capture group. Note that captures: 0:
is equivalent to using the name
key in this case.
- If you don't know what I mean by capture group, remember the section on regex? Where I told you to learn regular expressions? I wasn't kidding.
Before I continue, I'm going to show the "condensed" form of the same rule. I prefer it, as it wastes fewer lines on useless things like capture group numbers. A more detailed explanation is given in intermediate tips.
{
match: '(\\*)(.*?)(\\*)'
captures:
0: name: 'meta.bold.example'
1: name: 'punctuation.definition.bold.example'
2: name: 'markup.bold.example'
3: name: 'punctuation.definition.bold.example'
}
By now, you should be able to make some basic rules for your grammar. But what if you need to match across several lines? You want the begin
and end
keys.
{
name: 'meta.section.example'
contentName: 'markup.other.section.example'
begin: '((\\\\)section)(\\{)'
beginCaptures:
1: name: 'support.function.section.example'
2: name: 'punctuation.definition.function.example'
3: name: 'punctuation.definition.begin.example'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.end.example'
}
Some new keys:
name
: as with amatch
rule, thename
key applies to the entire match, including the text captured by thebegin
andend
patterns in this case.contentName
: applies the scope to the text between, but not including, thebegin
andend
captures.begin
: the pattern that defines when the rule begins.beginCaptures
: much likecaptures
in amatch
rule, but only applies to the text captured bybegin
.end
: the pattern that defines when the rule ends.endCaptures
: likebeginCaptures
, but for theend
text.
When you try this one, you might notice a distinct lack of colour. Maybe the \section
part is coloured, but nothing else is (using one dark theme at least). Lining up the cursor with a spot you want to check, running the command Editor: Log Cursor Scope
will show the scopes have indeed been applied. This demonstrates the divide between grammar and theme perfectly; the scopes have all been applied, but they are not coloured because the theme ignores them. Bear in mind that scopes are not solely for themes though, and some themes may use these seemingly useless scopes. As the grammar author, it's your job to provide as much information as possible about the file, by scoping accurately.
Another thing you might have noticed is that our other rules don't work inside of the section rule (and if you were experimenting, you'd have found they don't work inside if the bold match rule either). Basically, everything from the first to last character captured by a given rule is independent from the other rules in the main patterns
array. To apply rules to the captured text, we need to make a patterns
array inside the current rule. This patterns array behaves much like the outside one, except the rules it contains are only applied to the text between the begin
and end
captures of the rule it's in.
{
name: 'meta.section.example'
contentName: 'markup.other.section.example'
begin: '((\\\\)section)(\\{)'
beginCaptures:
1: name: 'support.function.section.example'
2: name: 'punctuation.definition.function.example'
3: name: 'punctuation.definition.begin.example'
end: '\\}'
endCaptures:
0: 'punctuation.definition.end.example'
patterns: [{
name: 'comment.line.example'
match: '#.*$'
}]
}
An important behaviour to observe now is what happens if one of these inside pattern rules is not finished when the end
pattern could be matched? Try the following to find out.
\section{ this is a section # }
is this still a section? }
How about now?
Here's a step by step overview of what happened:
- The text
\section{
is matched as the beginning of the rule - The tokenizer started looking for the
end
pattern, or any matches to the rules in the localpatterns
array. - The tokenizer matched the
# }
at the end of the first line with the comment rule inpatterns
, effectively hiding the first}
. - The tokenizer continued looking for
pattern
rule matches orend
matches when the comment rule ended (the end of the line in this case). - It sees the
}
on the second line and matches it with theend
pattern. - The rule is finished, so the final line has no special scopes.
But what if we actually wanted the rules in the main patterns
array to be active inside a begin
/end
rule? For this, there is the includes
key. It takes the name of a rule defined in the repository
(explained later), and pretends that rule was actually there. In this case, where we want it to match the main patterns
array, we would use one of two values:
$self
: (note this is not a regex) this value refers to the current grammar. That is, the context it's used in will have the rules in the mainpatterns
array applied to it.$base
: similar to$self
, but with some differences when embedded in another grammar. Not important right now, but just remember that$base
is not the same as$self
when your grammar is embedded in another.$self
points to the grammar$self
appears in (points to itself), whereas$base
points to the base language of the file, which could be anything. If you don't know what I mean by embedded, don't use$base
.
Right now, your example.cson
file should look something like this:
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false
patterns: [
{
comment: 'Use this to explain the function of the rule, if necessary'
name: 'comment.line.example'
match: '#.*$'
}
{
match: '(\\*)(.*?)(\\*)'
captures:
0: name: 'meta.bold.example'
1: name: 'punctuation.definition.bold.example'
2: name: 'markup.bold.example'
3: name: 'punctuation.definition.bold.example'
}
{
name: 'meta.section.example'
contentName: 'markup.other.section.example'
begin: '((\\\\)section)(\\{)'
beginCaptures:
1: name: 'support.function.section.example'
2: name: 'punctuation.definition.function.example'
3: name: 'punctuation.definition.begin.example'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.end.example'
patterns: [{ include: '$self' }]
}
]
Try it out on the following text:
Normal text # comment
* bold # text * <- not commented
\section{
text
# comment
* bo-#-ld * <- still not commented
text
}
text
Can you see what needs to be done to get comments working in a bold match? Did it work? Why not?
Remember, the match
pattern will only ever work on a single line (the tokenizer only looks at one line at a time; it literally doesn't see anything else). To get comments working in the bold rule, and get the bold rule to work across multiple lines, it needs to be converted to a begin
/end
rule as follows:
{
name: 'meta.bold.example'
contentName: 'markup.bold.example'
begin: '\\*'
beginCaptures:
0: name: 'punctuation.definition.bold.example'
end: '\\*'
endCaptures:
0: name: 'punctuation.definition.bold.example'
patterns: [{ include: '$self' }]
}
And so concludes the beginners section of the guide. With the tools above, you should be able to produce a grammar of reasonable complexity. What follows are some tips for intermediate authors, for additional features and best practices.
A feature mentioned above, but not explained, is the repository
. For a grammar of any reasonable size, the repository is vital to help organise your rules.
To make it, add the repository
key after the main patterns
array. It's value is an object, so do not add brackets after it. For example:
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false
patterns: [{ include: '#lineComment' }]
repository:
lineComment: {
comment: 'This is a rule object, with the same abilities as any other'
name: 'comment.line.example'
match: '#.*$'
}
secondRule: {
...
}
thirdRule: {
...
}
In the above example, a rule with the name lineComment
has been added to the repository. Note that rules in the repository are not automatically applied. They must be include
'd inside the main patterns
array, or into another rule's child patterns
array. To properly refer to this rule, the include
key must have the value '#lineComment'
as it does in the example. The rule itself is also a valid choice, and it will recursively apply itself as much as possible. This recursion also occurs with $self
, but self refers to the entire grammar, not just that specific rule.
I'm of the opinion that all rules should be added to the repository. Then, they can be activated as desired by including them in the patterns
array. I also like to group sets of rules into "meta" patterns, that are made up almost entirely of other include
'd rules. This allows you to form customised sets of rules that can be applied consistently, without repetition. For an example of this, see my end result from when I tried writing one. It's not perfect, and I could probably do with following my own advice in some parts. Overall though, I'm reasonably happy with it. You'll notice I use comments to help organise the repository
; this is another thing you should do to help yourself and others when trying to understand what you've done.
Another feature alluded to above, when talking about $self
vs $base
. Basically, other grammars can embed your grammar into their rule set, and vice versa. When embedding another grammar, you need to use include: 'source.language'
(where source.language
is the root scope of the target language; watch out for the text.
versions). For example
{
begin: '```'
end: '```'
patterns: [{ include: 'source.js' }]
}
will scope the contents of a three back-ticks pair to JavaScript.
$base
is important for embedded grammars, as it points to the file's root grammar. This means that if your grammar is embedded into another, e.g., a markdown grammar, $base
will point to the markdown grammar, not yours. Sometimes this behaviour is desirable, and is used extensively in the C
family of grammars.
One thing to be wary of is leakage: this occurs when a scope from the embedded grammar has not been closed, and it prevents your rule from seeing the end pattern. This is highly likely when the user will only be writing a portion of a code snippet, where there might be an opening brace but no closing one.
This can be seen using the rule above. In a file, leave an unmatched {
in the JavaScript section. Now, instead of picking up the back-ticks as matching the end pattern, it will instead be interpreted as a JavaScript string. From there, all scoping that follows will likely be broken.
Currently, there is no solution to this problem (that I am aware of). Hopefully a key will be added that makes the end
pattern more important, so it will be checked first before all others.
As a grammar author, you need to consider this from the other side too, i.e., thinking about others embedding your grammar. For every begin
rule you add, it should have an end
. Sometimes, less is more. If a match
rule works just as well, use the match
.
The condensed form introduced at the beginning is a good thing to use, but here I'll go through some style tips I developed after trying to read some grammars (both my own and others) myself.
- Keep the
patterns
array clean. If there are a lot of rules building up in one, consider moving them to the repository andinclude
ing them. This keeps the active rule set clutter free, while succinctly expressing the function or intention of each rule. - When there is only one entry in an object or array, keep it all on one line. For example, compare below. While the first may look more spaced out and easier to read, the second really helps when scrolling through a long list of rules. It's effectively halved the number of lines, while representing the exact same rule.
# Spread out
{
begin: '((\\\\)texttt)\\s*(\\{)'
beginCaptures:
1:
name: 'support.function.texttt.latex'
2:
name: 'punctuation.definition.function.latex'
3:
name: 'punctuation.definition.arguments.begin.latex'
end: '\\}'
endCaptures:
0:
name: 'punctuation.definition.arguments.end.latex'
contentName: 'markup.raw.texttt.latex'
patterns: [
{
include: '$self'
}
]
}
# Condensed
{
begin: '((\\\\)texttt)\\s*(\\{)'
beginCaptures:
1: name: 'support.function.texttt.latex'
2: name: 'punctuation.definition.function.latex'
3: name: 'punctuation.definition.arguments.begin.latex'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.arguments.end.latex'
contentName: 'markup.raw.texttt.latex'
patterns: [{ include: '$self' }]
}
- When there are multiple entries, break them across new lines. The spacing in this case (of the curly brackets column) helps with recognising when a group is together, and when it is separate. For example
# With multiple entries...
patterns: [ # multiple pattern entries, so new line
{ # multiple entries, so it gets a new line
comment: 'Handles all types of comments'
include: '#commentMeta'
}
{ include: '#escapedCurlyBracket' } # single entry, so no newline
{ include: '#metaOpenBrace' } # another single entry
]
# With one entry...
patterns: [{ # only one entry, so no new line between the [ and the {
match: 'blah' # the rule object has multiple entries, so a new line is required after the {
name: 'blah'
...
}]
- The braces are optional when the array has one entry. I like them, so I use them. It makes it easier to add additional rules though, so think carefully before omitting them.
- Be consistent. Worse than using any one style is using an inconsistent mixture, and making the reader think about the format of what they are reading.
- Don't use quotation marks for the key names. I haven't used them in this guide, but you will likely see some packages that do. To the best of my knowledge, these quotation marks do not contribute anything. In fact, they actively detract comprehension because syntax highlighting themes will make everything the same colour. By contrast, with unquoted form using
language-coffee-script
andone-dark
theme, I see key names as red, numbers as orange, and strings as green.
In addition to the name
, patterns
, repository
, etc. properties, there are some others that are recognised
firstLineMatch
: a regex string that assists Atom's automatic language selector. The selector generates a score for each language based on the file's extension and contents (so it was also usingfileTypes
behind the scenes). To
It is possible to include
specific rules from another grammar's repository: simply use the syntax
{ include: 'source.example#ruleName' }
When the #
character is not first, the part before it is taken as the grammar scope name. The part after is then read as the repository name, like with internal include
statements.
The actual function determining this behaviour is here.
Here I'll talk about valid scope names, and some good practices.
- Note: Dynamic scope names do not play friendly with some packages that depend on reading the scope. For example,
linter-spell
will only accept a list of absolute scopes to blacklist, making it incompatible with this style of scoping. Caching of values based on scope (e.g., for autocomplete) will also be negatively affected by dynamic scoping. This guide presents them in a "this is possible" sense, rather than "you should do this".
One feature I haven't mentioned at all yet is using the capture groups in scope names and other parts of the regex pattern. This is possible using $n
and \\n
(not newline!) notation, where n
is the capture group number. For example, using $n
in scope names
{
name: 'support.function.section.latex'
begin: '((\\\\)(section|paragraph|part|chapter)(\\*)?)(?=[^a-zA-Z@])'
beginCaptures:
# v
1: name: 'entity.name.section.$3.latex'
2: name: 'punctuation.definition.function.latex'
end: '\\}'
endCaptures:
0: name: 'punctuation.definition.end.latex'
patterns: [{ include: '$self' }]
}
And using \\n
in a regex match
{
name: 'string.function.verbatim'
# v
match: '\\\\verb([^a-zA-Z])(.*?)(?:(\\1)|$)'
captures:
0: name: 'support.function.vebatim.latex'
1: name: 'punctuation.latex'
2: name: 'markup.raw.verbatim.latex'
3: name: 'punctuation.latex'
}
Some rules I've observed when experimenting with backreferencing:
- Attempting to use
\\n
in a scope name results in the errorinvalid backref number/name
thrown byfirst-mate
. Only the$n
can be used here. - The opposite is also impossible;
$
is an active character in regex, so$n
will never match with the single line matching we are restricted to. Only\\n
can be used here. - If the
$n
capture group does not exist, it becomes a normal scope name. E.g., the scope would become literallysupport.function.$50.example
, and not a reference to the 50th capture group. An empty match still counts as a match, and if this happens the scope would becomesupport.function..example
(note the double.
). name
andcontentName
can only use the capture groups in thebegin
regex. Attempting to use higher numbers does not result in overflowing to theend
capture groups.beginCaptures
andendCaptures
will only use the capture groups inbegin
andend
regular expressions respectively. There is no way to use a value captured in abegin
group as a scope name in anend
scope, and vice versa.\\n
only refers to capture groups in thebegin
regex. It can be used in theend
regex, but will not refer to the end capture groups. Nor will it overflow to start meaningend
capture groups if the number is higher than that of the number of capture groups in thebegin
regex.match
behaves as if it were abegin
key, for the purposes of this numbering.\\n
only works for up to the number of capture groups there are. If there are less than nine capture groups, and\\9
is used, an error will be thrown. For numbers higher than 9, no errors are emitted, but that rule will not work.oniguruma
(the regex engine Atom uses) provides alternative syntax for\\n
matches:\\k<n>
, wheren
is any integer (e.g., use\\k<2>
for the second capture group). For more on the syntax, see the oniguruma docs. This verbose syntax doesn't work in scope names either.\\0
refers to the entirebegin
match.- Scopes probably shouldn't have punctuation in the sections, so make sure you don't just put arbitrary text in. E.g., use
([a-zA-Z\\d]*)
as opposed to(.*)
. I ran into some bizarre errors when I had punctuation in the scope names, but they are difficult to reproduce (and I've forgotten the original cause).
If you have anything to add to this list, please leave a comment. I want to make this an exhaustive list of whats possible and impossible with backreferencing.
Yes, it's possible. Try the following.
{
contentName: 'keyword.example'
begin: '\\-\\s*(.*?)\\s*\\-'
beginCaptures:
1:
name: 'markup.heading.example'
patterns: [{
name: 'constant.character.example'
match: 'b(.*?)b'
captures:
1:
name: 'markup.italic.example'
patterns: [{
match: 'c'
name: 'support.function.example'
}]
}]
end: '-'
}
Try it with this! Check the scopes too with Editor: Log Cursor Scope
1. - a b c d b c - hello -
2. - b c d b c - hello -
3. - a c d b c - hello -
4. - a b d b c - hello -
5. - a b c b c - hello -
6. - a b c d c - hello -
7. - a b c d b - hello -
Notice that the main rule will always match if the begin
regex works. It is also immune to the patterns applied to the capture groups; any matches in this way will be isolated to the captured group and will not leak into the rest of the main rule. Not even if another begin
pattern is used in the capture group (which is allowed).
For this particular rule, the first capture group is given the scope markup.heading.example
. This is done with name
, which is how the capture group has always been scoped in previous examples. What's new is the patterns
array that is also in the capture group object. It's entry, a single rule (multiple are allowed; it's just like any other patterns
array), attempts to find a pair of b
characters. If it succeeds, it will then attempt to match a c
character between the b
's.
Applying this to the example text, we get:
- A complete match:
a b c d b c
is scoped as a headingb c d b
satisfies the 1st capture group pattern, and is further scoped asconstant.character.example
- The first
c
satisfies the 1st capture group pattern of this sub pattern, and is further scoped assupport.function.example
. The secondc
is not between theb
's, so it is ignored.
- The first
- Another complete match. The initial
a
was never a required part. - Only matches the initial
begin
pattern, as the match'b(.*?)b'
is not satisfied. Because of this, thec
pattern is never even looked at. - The
b
pattern is matched, so thec
pattern is looked for, but there are noc
's within theb
's, so it is ultimately ignored. - Another complete match, similar to the first and second. The missing
d
was not a required part of any pattern. - Similar to the third, as the
b
pattern cannot be completed (there is no secondb
). - Similar to the second, except it's the second
c
that is missing and not thea
.
You should experiment yourself with nesting rules to see what happens. If you observe any quirky or unexpected effects not mentioned here, please leave a comment explaining how to reproduce and I'll add an explanation to this section.
Introducing a new property: injections
. This one sits at the root level of your grammar file, much like scopeName
and patterns
. It's value is an object, who's keys are scope selectors. The value of each key is another object, who's sole property is a patterns
array. This patterns
array has the same form as any other shown in this guide, and can be considered functionally the same. All other properties are ignored.
The purpose of injections are to provide patterns based on scope rather than include
ing or nesting them. I found the best way to explain them was by example: consider the PHP grammar provided by language-php
. It actually provides two grammar files: one for PHP syntax (php.cson
), and a wrapper one for HTML syntax (html.cson
). The pure PHP grammar does not get applied to any files automatically, as it lacks the fileTypes
and firstLineMatch
keys. Instead, the provided HTML grammar is applied to various PHP related files. This grammar sets the root scope name to text.html.php
, and provides two rules: a new comment, and the entire text.html.basic
grammar (provided by language-html
by default).
What is special though, is the injections
property it contains. Whenever the scope contains text.html.php
(which is the root scope), and none of the scopes start with meta.embedded
or meta.tag
, it will attempt to match the given patterns, in this case being the php-tag
rule in the repository. If these rules match, only then will the pure PHP grammar be inserted via an include
statement.
Some technical notes:
- Only the active grammar's injections are applied in a file. Injections in other grammars are not considered.
- Matches to injected patterns will be looked for last, after the active grammar and any
injectionSelector
added grammars. This can be influenced with scope prefixes though. - IMPORTANT: The
injectionSelector
causes a bug where any grammar with one will not automatically apply itself to an opened file. For a workaround, use the following style in an independent CSON file:
injectionSelector: 'source.embedded.latex' # when this scope is present in another grammar, inject this grammar
scopeName: 'source.embedded.latex'
patterns: [{ include: 'text.tex.latex' }]
- I will follow up on this after some testing, but I believe
injections
in a grammar that has been inserted via aninjectionSelector
should work.
To be added
To be added. For now, see my forum question.
This section looks at how the cson
file is converted into a grammar by first-mate
, and how the grammar is used to apply scopes to each character in the file. It will however focus more on what you as a package author can do, rather than the exact steps the first-mate
package takes to apply scopes to everything.
To see the source code in it's final form, you need to extract .atom/.apm/first-mate/<version>/package.tgz
. This has been transpiled from CoffeeScript to JavaScript, and built using Grunt
, so the files and directory structure will be different to the online source code. When writing this, I found it easiest to read the CoffeeScript version, while keeping in mind that the actual paths are those in the transpiled version.
- Note that JavaScript doesn't have classes, but I'll call them that because the objects are all defined using the class syntax.
A short summary of every recognised property of every construct you as a package author has access to.
In the root level of the grammar file:
name
fileTypes
scopeName
foldingStopMarker
(not used)maxTokensPerLine
maxLineLength
limitLineLength
injections
injectionSelector
patterns
repository
firstLineMatch
In a normal patterns
array object:
name
contentName
match
begin
end
patterns
captures
beginCaptures
endCaptures
applyEndPatternsLast
include
popRule
hasBackReferences
disabled
First of all, require("first-mate")
provides three classes:
ScopeSelector
GrammarRegistry
Grammar
Their behaviour are given in the following sections, as well as that of the other classes they depend on.
I don't know much about this one. It appears to be responsible for the scopes, and the specs show it supports some interesting syntax. Much of it is generated from a PEG.js
file though, so it will be difficult to understand without additional knowledge.
If someone can explain prefixes and injections, that would be appreciated.
Atom automatically creates and populates a GrammarRegistry
instance. This class is available as atom.grammars
, so definitely look at it in dev tools. For a specific grammar (next section) use atom.grammars.getGrammarForScopeName("<scope_name>")
.
It's job is to hold a group of grammars together, and provide helper functions. The steps taken to add grammars from packages are roughly as follows:
-
At some time or another, the method
loadGrammars
of thePackage
class is run for each package. This looks inside the package for agrammar
folder, and if present it will attempt to find any.cson
or.json
files in there. -
The method
readGrammar
uses theseason
package to read in the.cson
and.json
files in thegrammar
directory. Note that this means you could write a grammar inJSON
format. It throws an error if the object is invalid, and then checks that thescopeName
property is a non-empty string (throwing an error if it isn't). -
It calls the method
createGrammar
, with the arguments of the file path and the newly formed object. This method sets themaxTokensPerLine
andmaxLineLength
properties of the object (if not already present; if they are, the existing values are used instead). Additionally, it will then check if the object has thelimitLineLength
property. Iffalse
, it will setmaxLineLength
toInfinity
, regardless of earlier steps. -
It makes a
new Grammar
with the arguments of the global registry itself (this
) and the (slightly modified) object. The section on grammar object creation is below. -
When the grammar object returns, the method
createGrammar
continues and sets the propertypath
to the original file path, returning the grammar object. -
When this is returned, the
loadGrammars
method of thePackage
class continues. It sets the propertypackageName
of the grammar object to the package's name, and the propertybundledPackage
to the packagesbundledPackage
property value. Finally, it pushes the grammar object to the array of grammars provided by that package. It also runs the functiongrammar.activate()
, which pushes the grammar to the global registry.
This is where a single set of rules for a given language is defined. In the walk through above, it is called with the global grammar registry (which was passed in via the Package
class) and a slightly modified version of the CSON file.
Immediately, the GrammarRegistry
it is called with is added as the property registry
to the grammar object. Additionally, some select properties are looked at in the CSON file object. The following are directly added as properties to the grammar object:
-
name
: explained in beginner guide. Used as a human friendly label in the language selection window. -
fileTypes
: explained in beginner guide. Array of file extensions used to score grammars against a given file. If not provided, and empty array will be automatically created. -
scopeName
: explained in beginner guide. The root scope applied to all characters, regardless of pattern matches. -
foldingStopMarker
: will be covered in intermediate tips. Seems more or less useless for now. -
maxTokensPerLine
: the maximum number of rules that will be applied per line. Potentially added by thecreateGrammar
method in the above walk through if not already set. -
maxLineLength
: the maximum number of characters that will be tokenized per line. When internally set toInfinity
, it will have no limit. Potentially added or modified by thecreateGrammar
method in the above walk through. Setting toInfinity
directly in the grammar file results in an error and the grammar will not load.
The following are also recognised properties of the file object, but are processed somewhat before being added to the grammar object:
-
injections
: the grammar propertyinjections
is set to the result of anew Injections
called with the grammar object and theinjections
property of the file object. TheInjections
class is addressed in another section. Basically, it is set of scopes, and rules that are applied when the scope is reached. They only work when the grammar is the active one. For when it's not active,injectionSelector
must be used. For more on injections, see advanced tips. -
injectionSelector
: scopes to insert the grammar into when they occur in another grammar. For example,language-hyperlink
uses'text - string.regexp, string - string.regexp, comment, source.gfm'
(the-
sign means the following scope is not allowed; so it injects intext
, but not when also instring.regexp
). The grammar property is defined as the result of callingnew ScopeSelector
on the file object'sinjectionSelector
property. If this is not defined, a value ofnull
is used. -
patterns
: the grammar object'srawPatterns
property is directly set the thepatterns
property of the file object. -
repository
: the grammar object'srawRepository
property is directly set the therepository
property of the file object. -
firstLineMatch
: if defined, it is used to create a new (oniguruma) regex object, which is then made the value of the grammar object'sfirstLineRegex
property.
Finally, there are some properties of the grammar object that are created directly in the construction of the new grammar:
-
emitter
: an emitter is a JavaScript object that can be used to time function execution based on events emitted by the emitter. -
repository
: initialised tonull
, but the methodgetRepository
(not called during construction) sets it to a rule set using information in therawRepository
property, which in turn was set by the file object'srepository
. -
initialRule
: initialised tonull
, but the methodgetInitialRule
(called during tokenization) sets it to@createRule({@scopeName, patterns: @rawPatterns})
if it doesn't already have a value (that's notnull
). It is the first set of rules that will be checked. -
includedGrammarScopes
: initialised to an empty array. When a separate grammar is included by this grammar, it's scope name is added to this array.
Also bear in mind that some properties were added by the GrammarRegistry
and Package
classes when formed that way.
When the grammar object (from here: grammar) is constructed, it is not quite ready. At some point, the method tokenizeLines
is called on the text of a file. This splits the text by line (\n
character) and passes each one to the tokenizeLine
method, each time passing in the ruleStack
variable from the previous line.
In tokenizeLine
, it does the following:
-
First, it checks for a long line (as determined by the
maxLineLength
property), cutting it down if needed. -
It then converts the line into an
OnigString
, which is written inC
and presumably makes the following steps quicker. -
Now it checks if the
ruleStack
is notnull
. When called bytokenizeLines
, the first call will have thisnull
value.-
If not
null
: theruleStack
is copied (shallow) and thescopeName
andcontentScopeName
properties of each object it contains are pushed to another array, if they exist. -
If
null
: it executes the methodgetInitialRule
, which sets the grammar'sinitialRule
property to anew Rule
, called with the grammar object (this
), it'sGrammarRegistry
, and the options object{@scopeName, patterns: @rawPatterns}
. -
It pulls the
scopeName
andcontentScopeName
properties frominitialRule
, and sets these values as the first object in aruleStack
array. The rule stack keeps track of all currently active rules, and will only try to match the latest rule to be added to the stack.
-
Further behaviour can be determined by reading the source code directly.
This class handles the patterns that are applied based on scope, when this grammar is the active one.
Properties:
-
grammar
: the grammar class it was called with. -
injections
: an array of objects, with the propertiesselector
andpatterns
.selector
is aScopeSelector
class instance, formed from the the key of each property in the grammar file'sinjections
value.patterns
is the same as ever, with at least the top levelinclude
statements resolved. I'll need to test if nestedinclude
statements are also found properly. -
scanners
: an object that is presumably to hold instances of theScanner
class.
A Rule
is an object with some metadata properties and an array of Pattern
s. The Rule
itself does not match any text. This is for the Pattern
s, and the special endPattern
.
Perhaps the most relevant class to a package author, besides the grammar class. This is the one that holds the regex and other properties you add.
A hidden property: disabled
is checked when creating a new Rule
Immediately added properties:
-
grammar
: the grammar class it was called with -
registry
: the grammar registry class it was called with -
include
: a reference to a rule stored in the grammar repository -
popRule
: whether a match with this pattern should remove the rule it is a part of from the rule stack. -
hasBackReferences
: overrides automatic detection; ifnull
, the class will instead check for back references using/\\\d+/
on thematch
.
Grammar file properties that are processed:
-
name
: this is used to make@scopeName
, which is then passed to@grammar.createRule()
-
contentName
: same asname
, but the internal property iscontentScopeName
-
match
: ifend
is a property orpopRule
istrue
, and it has backreferences (either set explicity or detected by a quick regex), the grammar will set this value to the property@match
. If not, it is set to the property@regexSource
-
begin
: (this is only looked at ifmatch
doesn't exist).@regexSource
is set to thebegin
value.