Blog 2019/9/7
<- previous | index | next ->
Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.
- previous: Step 1b: A performance tweak
- next: Step 3: Evaluation
Time to parse our tokens into an AST (Abstract Syntax Tree)!
We'll start out with a very simple grammar:
SEXPR = INT | LIST
LIST = OPAREN CPAREN
| OPAREN SEXPR { WSPACE SEXPR } CPAREN
(A SEXPR
is an s-expression, short for symbollic expression.)
The above bit of EBNF says:
- a
SEXPR
is made up of either anINT
or aLIST
- a
LIST
is either an empty list, or a populated list, where- an empty list is an
OPAREN
followed by aCPAREN
, and - a populated list is an
OPAREN
, aSEXPR
, zero or more pairs ofWSPACE
andSEXPR
, followed by aCPAREN
.
- an empty list is an
The parsing technique we will use is called recursive descent.
In this technique, each of the production rules and terminals in the EBNF grammar are implemented as functions (which may call each other recursively).
Because these functions will call each other recursively, they need to share the same function signature. All of these functions will:
- take one argument: the list of remaining tokens
- return the resulting AST node and the list of remaining tokens, or
(None, None)
if the function did not match.
The general pattern will be something like the following pseudocode:
def parse_x(tokens):
if able_to_parse:
return (ast, remaining_tokens)
else:
return (None, None)
Start by writing some stubs:
parse.py
:
from lex import load_tokendefs, tokenize
def parse_int(tokens):
return (None, None)
def parse_oparen(tokens):
return (None, None)
def parse_cparen(tokens):
return (None, None)
def parse_wspace(tokens):
return (None, None)
def parse_list(tokens):
return (None, None)
def parse_sexpr(tokens):
return (None, None)
(Note: do not name this file parser.py
, as that will conflict with Python's parser
module.)
We'll start with the easy part: the terminals. We just need to check if the token type is correct.
Here's an implementation of parse_int
:
def parse_int(tokens):
if len(tokens) == 0:
return (None, None)
token = tokens[0]
if token[0] == "INT":
return (token, tokens[1:])
else:
return (None, None)
If the next token is an INT
, return an AST node (here, we'll use the token itself as an AST node)
along with the tokens which are left over after we consume the INT
token.
Otherwise, return (None, None)
.
Try it out:
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> from parse import *
>>> tokens = [("INT", "42")]
>>> parse_int(tokens)
(('INT', '42'), [])
For clarity, let's extract out a helper function, is_toktype
:
def is_toktype(toktype, token):
return token[0] == toktype
This makes parse_int
a bit clearer:
def parse_int(tokens):
if len(tokens) == 0:
return (None, None)
token = tokens[0]
if is_toktype("INT", token):
return (token, tokens[1:])
else:
return (None, None)
Let's implement parse_oparen
and parse_cparen
:
def parse_oparen(tokens):
if len(tokens) == 0:
return (None, None)
token = tokens[0]
if is_toktype("OPAREN", token):
return (token, tokens[1:])
else:
return (None, None)
def parse_cparen(tokens):
if len(tokens) == 0:
return (None, None)
token = tokens[0]
if is_toktype("CPAREN", token):
return (token, tokens[1:])
else:
return (None, None)
Hmm, these terminal-parsing functions are nearly identical. Let's factor out a helper function, parse_terminal
:
def parse_terminal(tokens, toktype):
if len(tokens) == 0:
return (None, None)
token = tokens[0]
if is_toktype(toktype, token):
return (token, tokens[1:])
else:
return (None, None)
def parse_int(tokens):
return parse_terminal(tokens, "INT")
def parse_oparen(tokens):
return parse_terminal(tokens, "OPAREN")
def parse_cparen(tokens):
return parse_terminal(tokens, "CPAREN")
def parse_wspace(tokens):
return parse_terminal(tokens, "WSPACE")
The above refactor will save us some work in the future if we need to add more terminal parsing functions or change their implementation.
Let's add a test:
def _test_parse_int():
tokens = [("INT", 1), ("WSPACE", " ")]
(ast, remaining) = parse_int(tokens)
assert ast == ("INT", 1)
assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_int()
>>>
Implementing production rules involves coming up with a coding technique for each of the features of EBNF:
- concatenation:
FOO BAR BAZ
- all of
parse_foo
,parse_bar
andparse_baz
must succeed in-order
- all of
- alternatives:
FOO | BAR | BAZ
- just one of
parse_foo
,parse_bar
,parse_baz
must succeed
- just one of
- repetition:
FOO { WSPACE FOO }
parse_foo
followed by awhile
loop of calls toparse_wspace
andparse_foo
First, we'll implement the simpler of the two production-rules: SEXPR = INT | LIST
def parse_sexpr(tokens):
(ast, remaining) = parse_int(tokens)
if ast is not None:
return (ast, remaining)
else:
(ast, remaining) = parse_list(tokens)
if ast is not None:
return (ast, remaining)
else:
return (None, None)
This is pretty simple:
- try to parse an int
- if that worked, return the AST node and the remaining tokens
- otherwise, try to parse a list
- if that worked, return the AST node and the remaining tokens
- otherwise, return
(None, None)
However, the final if/else
clause is a bit redunant. If ast
it not not None
, then it is None
, because parse_list
returned (None, None)
. So, in either case, we want to return whatever parse_list
returned to us, i.e.
else:
(ast, remaining) = parse_list(tokens)
if ast is not None:
return (ast, remaining)
else:
return (ast, remaining)
This means we can replace the entire else
clause by delegating to parse_list
:
def parse_sexpr(tokens):
(ast, remaining) = parse_int(tokens)
if ast is not None:
return (ast, remaining)
else:
return parse_list(tokens)
(Note that because we are delegating to one of either parse_int
or parse_list
,
we won't actually see a "SEXPR" node anywhere in the resulting AST.)
The other non-terminal function is a bit of a doozy.
Recall our grammar:
LIST = OPAREN CPAREN
| OPAREN SEXPR { WSPACE SEXPR } CPAREN
def parse_list(tokens):
# we have to have at least two tokens.
if len(tokens) < 2:
return (None, None)
# the first token must be an OPAREN.
(ast, remaining) = parse_oparen(tokens)
if ast is None:
return (None, None)
# try to match the empty list.
(ast, remaining2) = parse_cparen(remaining)
if ast is not None:
ast2 = ("LIST", [])
return (ast2, remaining2)
# try to match a populated list.
(sexpr, remaining) = parse_sexpr(remaining)
if sexpr is None:
return (None, None)
children = [sexpr]
while len(remaining) > 0:
# CPAREN signals the end of the list.
(ast, remaining2) = parse_cparen(remaining)
if ast is not None:
remaining = remaining2
break
if len(remaining) == 1:
# only one token left and it wasn't a CPAREN. bad parse.
return (None, None)
# otherwise, there should be a space and another sexpr.
(ast, remaining) = parse_wspace(remaining)
if ast is None:
return (None, None)
(ast, remaining) = parse_sexpr(remaining)
if ast is None:
return (None, None)
children.append(ast)
ast = ("LIST", children)
return (ast, remaining)
See if you can follow the code by reading the comments.
The AST node which will be returned will be ("LIST", children), where children is a list of child AST nodes.
(Note that while constructing the LIST
AST node, we discard any WSPACE
tokens.)
Here are some examples to illustrate what LIST
nodes look like:
() -> ("LIST", [])
(1) -> ("LIST", [("INT", "1")])
(1 2) -> ("LIST", [("INT", "1"), ("INT", "2")])
(()) -> ("LIST", [("LIST", [])])
(1 (2 3)) -> ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])
Let's write a test:
def _test_parse_list():
# parse a list from "() "
tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [])
assert remaining == [("WSPACE", " ")]
# parse a list from "(42) "
tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "42")])
assert remaining == [("WSPACE", " ")]
# parse "(1 2) "
tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
assert remaining == [("WSPACE", " ")]
# parse "(()) "
tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("LIST", [])])
assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_list()
>>>
and now that we have implemented parse_list
, we can also run (and test) parse_sexpr
:
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from parse import *
>>> tokens = [("INT", 1), ("WSPACE", " ")]
>>> parse_sexpr(tokens)
(('INT', 1), [('WSPACE', ' ')])
>>>
def _test_parse_sexpr():
# parse an int
tokens = [("INT", 1), ("WSPACE", " ")]
(ast, remaining) = parse_sexpr(tokens)
assert ast == ("INT", 1)
assert remaining == [("WSPACE", " ")]
# parse a list
tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_sexpr()
>>>
We'll wrap up the parser by implementing a single function which serves as its interface: parse
:
def parse(tokens):
(ast, remaining_tokens) = parse_sexpr(tokens)
if len(remaining_tokens) > 0:
raise Exception("Leftover tokens! %s" % remaining_tokens)
return ast
For now, this is pretty simple: try to parse a SEXPR
and ensure there aren't any leftover tokens (which would indicate a failed parse).
Write a test:
def _test_parse():
tdefs = load_tokendefs("tokendefs.txt")
ast = parse(tokenize(tdefs, "1"))
assert ast == ("INT", "1")
ast = parse(tokenize(tdefs, "()"))
assert ast == ("LIST", [])
ast = parse(tokenize(tdefs, "(1 2 3)"))
assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
ast = parse(tokenize(tdefs, "(1 (2 3))"))
assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])
and write a function which runs all of the tests:
def _test():
_test_parse_int()
_test_parse_list()
_test_parse_sexpr()
_test_parse()
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test()
>>>
and finish off with a __main__
implementation:
if __name__ == "__main__":
import sys
if len(sys.argv) > 1:
text = open(sys.argv[-1]).read()
else:
text = sys.stdin.read()
tdefs = load_tokendefs("tokendefs.txt")
import pprint
pprint.pprint(parse(tokenize(tdefs, text)))
We can now compare the behavior of lex.py
and parse.py
:
$ echo -n '(1 2 3)' | python lex.py
[('OPAREN', '('),
('INT', '1'),
('WSPACE', ' '),
('INT', '2'),
('WSPACE', ' '),
('INT', '3'),
('CPAREN', ')')]
$ echo -n '(1 2 3)' | python parse.py
('LIST', [('INT', '1'), ('INT', '2'), ('INT', '3')])
Continued in Step 3.