Blog 2020/6/18

A metagrammar for a parser generator

Part of writing a parser generator is specifying the grammar used to describe grammars, i.e. the grammar-grammar or metagrammar.

Just as ABNF is defined using ABNF, so too will I use my metagrammar to describe my metagrammar.

But first, an example grammar

Let's consider the grammar for a trivial lisp-like language.

Here is a classic recursive fibnacci implementation in such a language:

(define fib
  (lambda (n)
    (if (lessthan n 2)
      n
      (sum
        (fib (sum n -1))
        (fib (sum n -2))))))

The token definitions for such a language might be:

SYMBOL
[a-z]+
NUMBER
-?[0-9]+
OPAREN
\(
CPAREN
\)
WSPACE
\s+

and the grammar used to define this language might be:

           program = symbolicExpression+
symbolicExpression = list | atom
              list = OPAREN symbolicExpression* CPAREN
              atom = SYMBOL | NUMBER

The metagrammar

# The mkparser.py meta-grammar.

# A grammar is at least one production rule.
       grammar = productionRule+

# A production rule is a non-terminal, '=', and an expr, terminated by ';'.
productionRule = NONTERMINAL EQUALS expr SEMICOLON

# An expression is a sequence, a choice, or a unit.
          expr = sequence | choice | unit

# A sequence is at least two (choice or unit).
      sequence = (choice|unit) (choice|unit)+

# A choice is at least two units, delimited by '|'.
        choice = unit ( BAR unit )+

# A unit is a group or atom, with an optional arity operator.
          unit = zeroOrOne | zeroOrMore | oneOrMore | group | atom
     zeroOrOne = (group|atom) QMARK
    zeroOrMore = (group|atom) ASTERISK
     oneOrMore = (group|atom) PLUS

# A group is an expr enclosed in parenthesis.
         group = OPAREN expr CPAREN

# An atom is the name of a production rule or the name of a token type.
          atom = NONTERMINAL | TERMINAL

the the token definitions are:

#pragma discard WSPACE COMMENT
WSPACE
\s+
NONTERMINAL
[a-z][a-zA-Z0-9_]*
TERMINAL
[A-Z][A-Z0-9_]*
EQUALS
=
QMARK
\?
ASTERISK
\*
PLUS
\+
OPAREN
\(
CPAREN
\)
BAR
\|
COMMENT
#.*

Detailed description

Components

A grammar consists of:

production rule names
token type names
operators
whitespace and comments (both ignored)

Production rules

A grammar consists of one or more production rules.

Production rules are named using camelCase, with a lower-case leading character:

[a-z][a-zA-Z0-9_]*

This metagrammar uses = to define a production rule:

foo = bar

Definition symbols commonly used by other grammars include :, := and ::=.

All terminals must be tokens

Note that there are no quoted characters allowed. All terminals must be tokens, created by the lexer.

Token types are indicated in ALLCAPS:

[A-Z][A-Z0-9_]*

Grouping

As with many other grammar notations, ( and ) are used to indicate grouping.

Sequence

Sequence is indicated by simply listing several whitespace-delimited components:

ay = bee cee dee

Other grammars sometimes use , to indicate sequence.

Alternation / choice

Choice is indicated using |:

foo = bar | bin | baz

Note however that this operator indicates ordered choice a la PEG. Thus, the resulting parser does not implement back-tracking, and there is no possibility of ambiguity.

Note also that | only applies to its nearest neighbors. That is, this:

foo = one two | three four

is equivalent to:

foo = one ( two | three ) four

and does not mean:

foo = ( one two ) | ( three four )

Note also that alternation does not "see" an arity operator as a component, rather it "sees" the group consisting of the arity operator applied to its component.

That is to say, this:

foo = one+ | two* | three?

is equivalent to:

foo = ( one + ) | ( two * ) | ( three ? )

and does not mean:

foo = one ( + | two ) ( * | three ) ?

Arity operators (repetition)

This metagrammar borrows the arity operators from regex notation and PEG:

?: zero or one (a.k.a. optional)
*: zero or more
+: one or more

These can be used with terminals, non-terminals, and groups:

foo = NUMBER? expression* (SYMBOL NUMBER)+

Note that, like PEG, these operators are greedy.

Other grammars use [], {}, etc. to indicate repetition.

Whitespace and comments are discarded

After tokenization, all whitespace tokens and comment tokens are discarded before the tokens of the grammar are read by the parser generator.

Thus, foo = bar+ and foo = bar + are equivalent.

Automatic semicolon insertion

You may have noticed that none of the grammars I've shown use semicolons to delimit each production rule, and yet production rules are defined as:

productionRule = NONTERMINAL EQUALS expr SEMICOLON

So what gives?

It turns out it is easy to perform automatic semicolon insertion for these grammars.

After lexing the grammar, I run the JSON tokens through this middleware script to automatically insert the SEMICOLON tokens:

#!/usr/bin/env python

# Automatic semicolon insertion for mkparser.py grammars.

# Transformation:
# NONTERMINAL EQUALS  ->  SEMICOLON NONTERMINAL EQUALS
# (except for the first production rule)

import sys
import json

format_obj = None
use_fast_format = None
nonterminal_toktype_index = None
equals_toktype_index = None
semicolon_toktype_index = None

def toktype_is_nonterminal(token):
    if use_fast_format:
        return token[0] == nonterminal_toktype_index
    else:
        return token['token_type'] == 'NONTERMINAL'

def toktype_is_equals(token):
    if use_fast_format:
        return token[0] == equals_toktype_index
    else:
        return token['token_type'] == 'EQUALS'

def semicolon_token():
    if use_fast_format:
        return [semicolon_toktype_index, ';']
    else:
        return {'type': 'token', 'token_type': 'SEMICOLON', 'text': ';'}

def insert_semicolons(tokens):
    new_tokens = [tokens[0]]
    previous_token = tokens[1]
    for token in tokens[2:]:
        if toktype_is_nonterminal(previous_token) and toktype_is_equals(token):
            new_tokens.append(semicolon_token())
        new_tokens.append(previous_token)
        previous_token = token
        continue
    new_tokens.append(previous_token)
    new_tokens.append(semicolon_token())
    return new_tokens

if __name__ == '__main__':
    js_obj = json.loads(sys.stdin.read())
    format_obj = js_obj[0]
    use_fast_format = format_obj['format'] == 'fast'
    if use_fast_format:
        nonterminal_toktype_index = format_obj['token_types'].index('NONTERMINAL')
        equals_toktype_index = format_obj['token_types'].index('EQUALS')
        format_obj['token_types'].append('SEMICOLON')
        semicolon_toktype_index = format_obj['token_types'].index('SEMICOLON')
    tokens = js_obj[1]
    tokens = insert_semicolons(tokens)
    js_obj = [format_obj, tokens]
    output = json.dumps(js_obj)
    sys.stdout.write(output)
    if not output.endswith('\n'):
        sys.stdout.write('\n')

cellularmitosis/README.md