Blog 2020/6/18
<- previous | index | next ->
Part of writing a parser generator is specifying the grammar used to describe grammars, i.e. the grammar-grammar or metagrammar.
Just as ABNF is defined using ABNF, so too will I use my metagrammar to describe my metagrammar.
Let's consider the grammar for a trivial lisp-like language.
Here is a classic recursive fibnacci implementation in such a language:
(define fib
(lambda (n)
(if (lessthan n 2)
n
(sum
(fib (sum n -1))
(fib (sum n -2))))))
The token definitions for such a language might be:
SYMBOL
[a-z]+
NUMBER
-?[0-9]+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
and the grammar used to define this language might be:
program = symbolicExpression+
symbolicExpression = list | atom
list = OPAREN symbolicExpression* CPAREN
atom = SYMBOL | NUMBER
# The mkparser.py meta-grammar.
# A grammar is at least one production rule.
grammar = productionRule+
# A production rule is a non-terminal, '=', and an expr, terminated by ';'.
productionRule = NONTERMINAL EQUALS expr SEMICOLON
# An expression is a sequence, a choice, or a unit.
expr = sequence | choice | unit
# A sequence is at least two (choice or unit).
sequence = (choice|unit) (choice|unit)+
# A choice is at least two units, delimited by '|'.
choice = unit ( BAR unit )+
# A unit is a group or atom, with an optional arity operator.
unit = zeroOrOne | zeroOrMore | oneOrMore | group | atom
zeroOrOne = (group|atom) QMARK
zeroOrMore = (group|atom) ASTERISK
oneOrMore = (group|atom) PLUS
# A group is an expr enclosed in parenthesis.
group = OPAREN expr CPAREN
# An atom is the name of a production rule or the name of a token type.
atom = NONTERMINAL | TERMINAL
the the token definitions are:
#pragma discard WSPACE COMMENT
WSPACE
\s+
NONTERMINAL
[a-z][a-zA-Z0-9_]*
TERMINAL
[A-Z][A-Z0-9_]*
EQUALS
=
QMARK
\?
ASTERISK
\*
PLUS
\+
OPAREN
\(
CPAREN
\)
BAR
\|
COMMENT
#.*
A grammar consists of:
- production rule names
- token type names
- operators
- whitespace and comments (both ignored)
A grammar consists of one or more production rules.
Production rules are named using camelCase, with a lower-case leading character:
[a-z][a-zA-Z0-9_]*
This metagrammar uses =
to define a production rule:
foo = bar
Definition symbols commonly used by other grammars include :
, :=
and ::=
.
Note that there are no quoted characters allowed. All terminals must be tokens, created by the lexer.
Token types are indicated in ALLCAPS:
[A-Z][A-Z0-9_]*
As with many other grammar notations, (
and )
are used to indicate grouping.
Sequence is indicated by simply listing several whitespace-delimited components:
ay = bee cee dee
Other grammars sometimes use ,
to indicate sequence.
Choice is indicated using |
:
foo = bar | bin | baz
Note however that this operator indicates ordered choice a la PEG. Thus, the resulting parser does not implement back-tracking, and there is no possibility of ambiguity.
Note also that |
only applies to its nearest neighbors. That is, this:
foo = one two | three four
is equivalent to:
foo = one ( two | three ) four
and does not mean:
foo = ( one two ) | ( three four )
Note also that alternation does not "see" an arity operator as a component, rather it "sees" the group consisting of the arity operator applied to its component.
That is to say, this:
foo = one+ | two* | three?
is equivalent to:
foo = ( one + ) | ( two * ) | ( three ? )
and does not mean:
foo = one ( + | two ) ( * | three ) ?
This metagrammar borrows the arity operators from regex notation and PEG:
?
: zero or one (a.k.a. optional)*
: zero or more+
: one or more
These can be used with terminals, non-terminals, and groups:
foo = NUMBER? expression* (SYMBOL NUMBER)+
Note that, like PEG, these operators are greedy.
Other grammars use []
, {}
, etc. to indicate repetition.
After tokenization, all whitespace tokens and comment tokens are discarded before the tokens of the grammar are read by the parser generator.
Thus, foo = bar+
and foo = bar +
are equivalent.
You may have noticed that none of the grammars I've shown use semicolons to delimit each production rule, and yet production rules are defined as:
productionRule = NONTERMINAL EQUALS expr SEMICOLON
So what gives?
It turns out it is easy to perform automatic semicolon insertion for these grammars.
After lexing the grammar, I run the JSON tokens through this middleware
script to automatically insert the SEMICOLON
tokens:
#!/usr/bin/env python
# Automatic semicolon insertion for mkparser.py grammars.
# Transformation:
# NONTERMINAL EQUALS -> SEMICOLON NONTERMINAL EQUALS
# (except for the first production rule)
import sys
import json
format_obj = None
use_fast_format = None
nonterminal_toktype_index = None
equals_toktype_index = None
semicolon_toktype_index = None
def toktype_is_nonterminal(token):
if use_fast_format:
return token[0] == nonterminal_toktype_index
else:
return token['token_type'] == 'NONTERMINAL'
def toktype_is_equals(token):
if use_fast_format:
return token[0] == equals_toktype_index
else:
return token['token_type'] == 'EQUALS'
def semicolon_token():
if use_fast_format:
return [semicolon_toktype_index, ';']
else:
return {'type': 'token', 'token_type': 'SEMICOLON', 'text': ';'}
def insert_semicolons(tokens):
new_tokens = [tokens[0]]
previous_token = tokens[1]
for token in tokens[2:]:
if toktype_is_nonterminal(previous_token) and toktype_is_equals(token):
new_tokens.append(semicolon_token())
new_tokens.append(previous_token)
previous_token = token
continue
new_tokens.append(previous_token)
new_tokens.append(semicolon_token())
return new_tokens
if __name__ == '__main__':
js_obj = json.loads(sys.stdin.read())
format_obj = js_obj[0]
use_fast_format = format_obj['format'] == 'fast'
if use_fast_format:
nonterminal_toktype_index = format_obj['token_types'].index('NONTERMINAL')
equals_toktype_index = format_obj['token_types'].index('EQUALS')
format_obj['token_types'].append('SEMICOLON')
semicolon_toktype_index = format_obj['token_types'].index('SEMICOLON')
tokens = js_obj[1]
tokens = insert_semicolons(tokens)
js_obj = [format_obj, tokens]
output = json.dumps(js_obj)
sys.stdout.write(output)
if not output.endswith('\n'):
sys.stdout.write('\n')