Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active March 29, 2023 01:07
Show Gist options
  • Save cellularmitosis/3d353db59a6e333230b22b405dbc5345 to your computer and use it in GitHub Desktop.
Save cellularmitosis/3d353db59a6e333230b22b405dbc5345 to your computer and use it in GitHub Desktop.
mklexer.py: a lexer generator

Blog 2020/6/11

<- previous | index | next ->

mklexer.py: a lexer generator

See also part 2, 3, 4, 5

Over the past few years, I've written a number of regex-based lexers and recursive-descent parsers, following the technique described in Gary Bernhardt's A Compiler from Scratch (free and well worth your time -- go watch it now!).

I've realized that writing this code is really just a mechanical process. This means we can write a program to write these lexers and parsers for us!

(Yes, I know, this is not a new idea :)

mklexer.py is a program which generates Python code which implements a regex-based lexer, according to a token definitions file. The generated lexer will output tokens in a standardized JSON format.

Screenshot_2020-06-11_22-07-40

Goals:

  • Reap the benefits of a declarative approach
  • Generate code which is easily understood and easy to hack on
  • Reduce time spent on boilerplate

Non-goals:

  • Lexer performance

The token definitions file format

The format of the token definitions file is a series of pairs of lines:

  • a line naming the token type (in ALLCAPS),
  • followed by a line defining the regex which regonizes that type of token.

For example, here is a token definitions file which tokenizes a trivial S-expression syntax:

tokendefs.txt:

OPAREN
\(
CPAREN
\)
NUMBER
-?\d+(\.\d+)?
SYMBOL
[a-z]+
WSPACE
\s+

Here's an example of the sort of input which we expect this lexer to tokenize:

(define fibonacci
  (lambda (n)
    (if (lessthan n 2)
      n
      (sum (fibonacci (sum n -1)) (fibonacci (sum n -2))))))

JSON output

The generated lexer will spit out the tokens in JSON.

The JSON will conform the following format:

  • The top-level JSON structure will be an array (of tokens)
  • Each token will be a JSON object (dictionary)
  • Each token will contain the following keys:
    • type, with value TOKEN
    • token_type, with a value which is one of the ALLCAPS token types
    • text, with a value which is the text matched by the regex.

For example, if we use the above tokendefs.txt to tokenize the following input:

(define pi 3.14159)

We should get the following JSON output:

[
  {
    "type": "TOKEN",
    "token_type": "OPAREN",
    "text": "("
  },
  {
    "type": "TOKEN",
    "token_type": "SYMBOL",
    "text": "define"
  },
  {
    "type": "TOKEN",
    "token_type": "WSPACE",
    "text": " "
  },
  {
    "type": "TOKEN",
    "token_type": "SYMBOL",
    "text": "pi"
  },
  {
    "type": "TOKEN",
    "token_type": "WSPACE",
    "text": " "
  },
  {
    "type": "TOKEN",
    "token_type": "NUMBER",
    "text": "3.14159"
  },
  {
    "type": "TOKEN",
    "token_type": "CPAREN",
    "text": ")"
  }
]

"Fast" JSON output

The generated lexer also supports an array-based "fast" token format.

Each token is array consisting of two elements:

  • A numeric index into the list of token types
  • The token text

For our same (define pi 3.14159) example, the "fast" output looks like this:

./lexer.py input2.txt --fast
[[0, "("], [3, "define"], [4, " "], [3, "pi"], [4, " "], [2, "3.14159"], [1, ")"], [4, "\n"]]

Usage

$ ./mklexer.py --help
mklexer.py: generate a Python lexer based on a token definitions file.

Display help:
  mklexer.py -h
  mklexer.py --help

Generate a lexer using tokendefs.txt:
  mklexer.py tokendefs.txt > lexer.py
  chmod +x lexer.py

tokendefs.txt consists of pairs of TOKENTYPE and <regex> lines.

Example tokendefs.txt:
NUMBER
-?\d+(\.\d+)?
SYMBOL
[a-zA-Z_][a-zA-Z0-9_-]*

Use the lexer on input.txt, producing the standard JSON token format:
  ./lexer.py input.txt | jq .

Two example tokens in standard JSON format:
  {'type': 'TOKEN', 'token_type': 'NUMBER', 'text': '3.14159'}
  {'type': 'TOKEN', 'token_type': 'SYMBOL', 'text': 'fibonacci'}

Use the lexer on input.txt, producing "fast" array-based JSON tokens:
  ./lexer.py --fast input.txt | jq .

"fast" tokens are [<token type index>, <matched text>] pairs.

The same example tokens, but in 'fast' JSON format:
  [0, '3.14159']
  [1, 'fibonacci']
(define pi 3.14159)
(define fibonacci
(lambda (n)
(if (lessthan n 2)
n
(sum (fibonacci (sum n -1)) (fibonacci (sum n -2))))))
lexer.py: mklexer.py tokendefs.txt
./mklexer.py tokendefs.txt > lexer.py
chmod +x lexer.py
clean:
rm -f lexer.py
.PHONY: clean
#!/usr/bin/env python
# mklexer.py: generate a Python lexer based on a token definitions file.
# See https://github.com/pepaslabs/mklexer.py
# Copyright (c) 2020 Jason Pepas
# Released under the terms of the MIT license.
# See https://opensource.org/licenses/MIT
import sys
import os
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def usage(fd):
"""Prints the usage help to the given file descriptor."""
exe = os.path.basename(sys.argv[0])
w = fd.write
if fd is sys.stderr:
w("Error: bad usage.\n")
w("\n")
else:
w("%s: generate a Python lexer based on a token definitions file.\n" % exe)
w("\n")
w("Display help:\n")
w(" %s -h\n" % exe)
w(" %s --help\n" % exe)
w("\n")
w("Generate a lexer using tokendefs.txt:\n")
w(" %s tokendefs.txt > lexer.py\n" % exe)
w(" chmod +x lexer.py\n")
w("""
tokendefs.txt consists of pairs of TOKENTYPE and <regex> lines.
Example tokendefs.txt:
NUMBER
-?\\d+(\\.\\d+)?
SYMBOL
[a-zA-Z_][a-zA-Z0-9_-]*
Use the lexer on input.txt, producing the standard JSON token format:
./lexer.py input.txt | jq .
Two example tokens in standard JSON format:
{'type': 'TOKEN', 'token_type': 'NUMBER', 'text': '3.14159'}
{'type': 'TOKEN', 'token_type': 'SYMBOL', 'text': 'fibonacci'}
Use the lexer on input.txt, producing "fast" array-based JSON tokens:
./lexer.py --fast input.txt | jq .
"fast" tokens are [<token type index>, <matched text>] pairs.
The same example tokens, but in 'fast' JSON format:
[0, '3.14159']
[1, 'fibonacci']
""")
def load_tokendefs(fpath):
"""Loads the token definitions file."""
tokendefs = []
fd = open(fpath, 'r')
content = fd.read()
fd.close()
lines = content.splitlines()
i = 0
while i < len(lines):
tokentype = lines[i]
if len(tokentype) == 0:
raise Exception("Line %d: Zero-length token type." % i+1)
i += 1
if i >= len(lines):
raise Exception(
"Line %d: Token type '%s' has no corresponding regex." \
% (i, tokentype)
)
regex = lines[i]
if len(regex) == 0:
raise Exception("Line %d: Zero-length regex." % i+1)
i += 1
pair = (tokentype, regex)
tokendefs.append(pair)
continue
return tokendefs
def codegen_tokendefs(tokendefs):
"""Generates the Python code of the tokendefs table."""
fd = StringIO()
w = fd.write
w("tokendefs = [\n")
for token_type, regex in tokendefs:
w(" ('%s', " % token_type)
# do everything we can to avoid the backslash plague.
if "'" not in regex:
w("r'%s'" % regex)
elif '"' not in regex:
w('r"%s"' % regex)
elif "'''" not in regex and not regex.startswith("'") and not regex.endswith("'"):
w("r'''%s'''" % regex)
elif '"""' not in regex and not regex.startswith('"') and not regex.endswith('"'):
w('r"""%s"""' % regex)
else:
# oh well, at least we tried :shrug:
w(regex.__repr__())
w("),\n")
w("]\n")
code = fd.getvalue()
fd.close()
return code
def codegen(tokendefs):
"""Generates the Python code of the lexer."""
fd = StringIO()
w = fd.write
w("""#!/usr/bin/env python
# DO NOT EDIT: this lexer was generated by mklexer.py.
import sys
import re
import json
""")
tokendefs_code = codegen_tokendefs(tokendefs)
w(tokendefs_code)
w("""
tokendefs = [(toktype, re.compile(regex)) for (toktype, regex) in tokendefs]
def get_linenum_charnum(text, offset):
\"\"\"Returns the line number and character number of the offset.\"\"\"
linenum = 1
charnum = 1
i = 0
while i < offset:
if text[i] == '\\n':
linenum += 1
charnum = 1
continue
else:
charnum += 1
continue
return (linenum, charnum)
def consume_next_token(text, offset, use_fast_format):
\"\"\"Consumes next token from the given text input.
Returns a (token, offset) pair.
Throws if no tokens match.\"\"\"
i = 0
while i < len(tokendefs):
(token_type, regex) = tokendefs[i]
m = regex.match(text, offset)
if m is None:
i += 1
continue
else:
matched_text = m.group()
if use_fast_format:
token = [i, matched_text]
else:
token = {
'type': 'TOKEN',
'token_type': token_type,
'text': matched_text,
}
new_offset = offset + len(matched_text)
return (token, new_offset)
# none of the token types matched
(linenum, charnum) = get_linenum_charnum(text, offset)
raise Exception(
\"Can't lex starting at line %d, character %d, context: '%s'\" \\
% (linenum, charnum, text[offset:offset+32])
)
def lex(text, use_fast_format):
\"\"\"Returns a list of tokens for the given text input.\"\"\"
tokens = []
offset = 0
while offset < len(text):
(token, offset) = consume_next_token(text, offset, use_fast_format)
tokens.append(token)
continue
return tokens
if __name__ == '__main__':
infile = [arg for arg in sys.argv[1:] if not arg.startswith('-')][-1]
use_fast_format = False
if '--fast' in sys.argv[1:]:
use_fast_format = True
fd = open(infile, 'r')
text = fd.read()
fd.close()
tokens = lex(text, use_fast_format)
output = json.dumps(tokens)
if not output.endswith('\\n'):
output += '\\n'
sys.stdout.write(output)
""")
code = fd.getvalue()
fd.close()
return code
if __name__ == "__main__":
if len(sys.argv) < 2:
usage(sys.stderr)
sys.exit(1)
if '-h' in sys.argv or '--help' in sys.argv:
usage(sys.stdout)
sys.exit(0)
# the last non-option arg is the tokendefs file.
tokendefs_fpath = None
non_option_args = [arg for arg in sys.argv[1:] if not arg.startswith('-')]
if len(non_option_args) != 1:
usage(sys.stderr)
sys.exit(1)
tokendefs_fpath = non_option_args[0]
tokendefs = load_tokendefs(tokendefs_fpath)
code = codegen(tokendefs)
sys.stdout.write(code)
OPAREN
\(
CPAREN
\)
NUMBER
-?\d+(\.\d+)?
SYMBOL
[a-z]+
WSPACE
\s+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment