Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active April 10, 2025 16:19
Show Gist options
  • Save cellularmitosis/da7290bc8f7e822c630b878af9f11e90 to your computer and use it in GitHub Desktop.
Save cellularmitosis/da7290bc8f7e822c630b878af9f11e90 to your computer and use it in GitHub Desktop.
A general-purpose symbolic expression parser in Python

Blog 2025/4/9

<- previous | index | next ->

A general-purpose symbolic expression parser in Python

Symbolic expressions are a simple syntax for representing trees. You may recognize them as the core of the syntax of Scheme and Common Lisp.

This parser starts with a trick from Peter Norvig, where each ( and ) are surrounded with an extra space, allowing you to simply .split() the text into tokens.

However, that approach breaks on any string literals which include spaces.

We can fix this by using a string literal regex (see my previous post) to first break the text up into string and non-string chunks, then apply Norvig's .split() to the non-string chunks.

Throw in a little recursion to catch any unbalanced parens, and we're done!

Demos:

$ cat exprs1.txt 
(statement (return (string "I said \"hello\" to the cat")))
$ ./parse_exprs.py exprs1.txt 
['statement', ['return', ['string', '"I said \\"hello\\" to the cat"']]]
$ cat exprs2.txt 
(vardecl (type pointer (type char)) (name message) (value (string "I said \"hello\" to the cat")))
$ ./parse_exprs.py exprs2.txt 
['vardecl',
 ['type', 'pointer', ['type', 'char']],
 ['name', 'message'],
 ['value', ['string', '"I said \\"hello\\" to the cat"']]]
#!/usr/bin/env python3
# A general-purpose symbolic expression parser.
# Copyright 2025 Jason Pepas.
# Released under the terms of the MIT license, see https://opensource.org/license/mit
import sys
import re
def string_chunkify(text):
"break 'text' into string and non-string chunks"
str_regex = r'("(?:[^\"\\]|\\[\s\S])*")'
return re.split(str_regex, text, flags=re.MULTILINE)
def tokenize_exprs(text):
"tokenize the symbolic expressions in 'text'"
tokens = []
for chunk in string_chunkify(text):
if chunk.startswith('"'):
tokens.append(chunk)
else:
# This clever trick comes from Peter Norvig! https://norvig.com/lispy.html
tokens += (chunk
.replace('(', ' ( ')
.replace(')', ' ) ')
.split())
return tokens
def parse_exprs(text):
"parse the symbolic expressions in 'text' and return a syntax tree."
def parse_expr_tokens(tokens, offset=0):
"recursively parse the list of tokens into a syntax tree."
ast = []
if len(tokens)-offset == 0:
return ast
i = offset
if tokens[i] != '(':
raise Exception("Missing '(' before tokens: %s" % tokens[offset:offset+10])
i += 1
while i < len(tokens):
token = tokens[i]
if token == '(':
(subast, i) = parse_expr_tokens(tokens, i)
ast.append(subast)
continue
if token == ')':
i += 1
return (ast, i)
else:
ast.append(token)
i += 1
continue
raise Exception("Missing ')' at end of tokens.")
tokens = tokenize_exprs(text)
(ast, i) = parse_expr_tokens(tokens)
if i != len(tokens):
raise Exception("Leftover tokens: %s" % tokens[i:])
return ast
if __name__ == "__main__":
with open(sys.argv[-1]) as fd:
text = fd.read()
ast = parse_exprs(text)
import pprint
pprint.pprint(ast)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment