Blog 2020/6/12
<- previous | index | next ->
In a previous post, I introduced a lexer generator.
In this post I'll descibe a few changes:
- output formats:
tokens,lines,tokens-fastandlines-fast - pragmas:
line-oriented,discardandeof - lexical grammars now support empty lines and comments
I typed up a descriptive prose spec of mklexer.py, but I hated it. Instead I'll describe it by example.
If this is your token definitions file:
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
and this is your input:
pi=3.14159
then the lexer JSON output (in the default tokens format) will be:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"}
]
]The top-level structure is still an array,
but now the first item is a format object
and the second item is the list of tokens.
If the lexer is invoked with the --fast command-line option, the above example would have this output:
[
{"type": "format", "format": "fast", "token-types": ["ASSIGN", "NUMBER", "SYMBOL"]},
[
[2, "pi"],
[0, "="],
[1, "3.14159"]
]
]The format object now contains a token-types list of token type names.
The tokens are now tuples,
where the first item is a numeric index into the token-types list,
and the second item is the token text.
There are two additional formats: tokens-lines and fast-lines,
which break up the list of tokens into individual lines.
These formats are activated using the line-oriented pragma.
Lexical grammar:
#pragma line-oriented
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
NEWLINE
\n
Input text:
pi=3.14159
phi=1.618
or, more specifically:
pi=3.14159\nphi=1.618
JSON output in tokens-lines format:
[
{"type": "format", "format": "tokens-lines"},
[
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"},
],
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
]
]
]and in fast-lines format:
[
{"type": "format", "format": "fast-lines", "token-types": ["ASSIGN", "NUMBER", "SYMBOL", "NEWLINE"]},
[
[
[2, "pi"],
[0, "="],
[1, "3.14159"],
],
[
[2, "phi"],
[0, "="],
[1, "1.618"]
]
]
]The discard pragma lists token types which will be automatically discarded from the output.
Lexical grammar:
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
WSPACE
\s+
COMMENT
;.*
Input text:
phi = 1.618 ; the golden ratio
Output without the discard pragma:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "WSPACE", "text": " "},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "WSPACE", "text": " "},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
{"type": "token", "token-type": "WSPACE", "text": " "},
{"type": "token", "token-type": "COMMENT", "text": "; the golden ratio"},
]
]Now, we use the discard pragma to get rid of whitespace and comments:
#pragma discard WSPACE COMMENT
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
WSPACE
\s+
COMMENT
;.*
and our output becomes:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
]
]The lexical grammar now supports comments and empty lines.
Here's the grammar from our previous example, but with some spacing and comments:
# A lexical grammar for trivial assignment statements.
# our parser doesn't care about whitespace and comments
#pragma discard WSPACE COMMENT
# the assignment operator
ASSIGN
=
# integer and floating-point numbers
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
WSPACE
\s+
# comments extend to the end of the current line
COMMENT
;.*
The eof pragma will append an 'EOF' token.
Lexical grammar:
#pragma eof
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
input:
pi=3.14159
tokens:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"},
{"type": "token", "token-type": "EOF", "text": ""}
]
]For the --fast format, it also adds EOF to token-types:
[
{"type": "format", "format": "fast", "token-types": ["ASSIGN", "NUMBER", "SYMBOL", "EOF"]},
[
[2, "pi"],
[0, "="],
[1, "3.14159"],
[3, ""]
]
]For line-oriented output, the EOF token will always appear on its own line. Example 3 would look like:
[
{"type": "format", "format": "tokens-lines"},
[
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"},
],
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
],
[
{"type": "token", "token-type": "EOF", "text": ""}
]
]
]