I don't have the usual programmer's education. I studied maths, and then dropped out of that, and am mostly self-taught. So, there are some parts of programming I always saw wearily, thinking to myself that I really should go to school to learn them. One remarkable such area is parsing and implementing languages.
Well... sure, school is always a good idea, but this is not that hard. In this article I will explain how to go from nothing to a functioning, extensible language, using Python and PyParsing. If you are as scared of grammars, parsers and all that jazz as I used to be, come along, it's pretty simple,
What is a grammar? It's what defines what is and what is not valid in our language, and it's the hardest part to get right, mostly because it's all about making decisions.
First decision: language name. I will call this language Least Optimized Language Ever, or LOLE
I will describe the language informally, then show you how to describe that grammar formally using PyParsing.
PyParsing is "constructive", you start creating small blacks, then combine those blocks into larger things.
Some things in LOLE will have names. For example, variables, or functions. So, we need to define what a name is. Most languages allow you to use identifiers that start with a letter, and then use letters or numbers. Let's go even simpler: identifiers are words. Just letters.
identifier = Word(alphas)Word is a PyParsing object that means "these characters". alphas is a constant that means "letters". So: "some letters".
You can actually try this in an interactive interpreter:
>>> from pyparsing import *
>>> identifier = Word(alphas)
>>> identifier.parseString("foo")
(['foo'], {})identifier.parseString takes a string, and tries to parse it as an identifier. So, "foo" is an identifier, and we get the match. Nice!
What happens if we try something that doesn't match?
>>> identifier.parseString("42")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected W:(ABCD...) (at char 0), (line:1, col:1)Surprise! You get an error. It's a pretty awful error, but it basically says "I was expecting a thing made of letters (ABCD...) and I did not get it".
Because error messages are important let's make that better already. There's always time to develop bad habits later.
>>> identifier = Word(alphas).setName('identifier')
>>> identifier.parseString("42")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected identifier (at char 0), (line:1, col:1)Now it says it was expecting an identifier, and it did not get one. So it's more clear. Let's try one last thing:
>>> identifier.parseString("foo bar")
(['foo'], {})You may say "hey, there are two identifiers there, and it only found one!" and you are right. That's because we are parsing identifier not identifierS. We told it to understand "foo bar" as an identifier, and it did just that. By default PyParsing will ignore anything left in the input after it found what it's looking for. If you want to make sure the whole input parses correctly, then you have to say so:
>>> identifier.parseString("foo bar", parseAll=True)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected end of text (at char 4), (line:1, col:5)And yes, that is a valid error. We asked to know if "foo bar" is an identifier. It's not, because to be one it would have to end after "foo" because identifiers can't have spaces, and it does not.
Let's move to something more challenging....
LOLE has literals for two types: strings and numbers. They are the usual thing, except that strings are python style, either single or double quoted, and support escaping quotes inside them.
I could now explain how to define numbers in terms of literals and combinations of symbols and numbers and so on. And strings as a list of characters surrounded by quotes. But I am going to tell you something much more important: learn your library.
literal = quotedString ^ pyparsing_common.numberquotedString is provided by PyParsing to define "strings as usual". And number defines "numbers as usual" so we don't have to do it. Because most languages need this, so having to do it each time is dumb.
We are also using ^ there. That means "one or the other". So literal will successfully parse things that are strings and also things that are numbers.
>>> literal.parseString("-43.21E12")
([-43210000000000.0], {})
>>> literal.parseString(r"'this is a string with a \' in it!'")
(["'this is a string with a \\' in it!'"], {})And of course, if you parse something else, it will fail:
>>> literal.parseString(r"foo")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected {quotedString using single or double quotes ^ {real number with scientific notation | real number | signed integer}} (at char 0), (line:1, col:1)Like the error says, it was expecting either a string quoted with single or double quotes, or one of a few kinds of numbers.
So, now that we have identifiers and values, we can make this more like a real language and implement ...
This is pretty simple:
foo = 42
So, an assignment is (for now):
- An identifier
- An equal sign
- A literal value
And this is how it looks in PyParsing:
>>> assignment = identifier + Literal('=') + literal
>>> assignment.parseString('foo = 42')
(['foo', '=', 42], {})So far, we only support programs that do one thing and then end. That is not very useful. So let's define what a program is: "A list of statements separated with semicolons". So far, the only kind of statement we have is assignment, so it will look like this:
program = delimitedList(assignment, delim=';')As usual, PyParsing does the heavy lifting for us: see what delimitedList does.
And now we can parse slightly more interesting things:
>>> program.parseString('foo = 42; bar="bat"')
(['foo', '=', 42, 'bar', '=', '"bat"'], {})Let's stop defining grammar now for a little bit (we will come back to it) and think about how to use this thing we are creating.
So we now have the world's most usesess language, that only supports syntax for assigning values to variables. How do we make it do that?
Well, we need to take the output of parseString and process it, doing what our "program" says we should do. That's an interpreter.
So, how do we "run" ['foo', '=', 42, 'bar', '=', '"bat"']?
For starters, I don't really like it. There are two statements there but we have six elements and no clear boundary between one statement and another. To fix that, we can change assignment:
>>> assignment = Group(identifier + Literal('=') + literal)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString('foo = 42; bar="bat"')
[['foo', '=', 42], ['bar', '=', '"bat"']]Unsurprisingly, what Group does is group things, so we now have each statement as its own separate thing.
So, we could "run" this program like this:
>>> def run_program(code):
... parsed = program.parseString(code)
... vars = {}
... for name, _, value in parsed:
... vars[name] = value
... return vars
>>> run_program('foo = 42; bar="bat"')
{'foo': 42, 'bar': '"bat"'}And hey, that means this is an interpreter already. Not a very good one, tho!
Now, how about this LOLE program? foo = 42; bar = foo
As we have defined the language so far, that's illegal, because on the right side of the equal sign LOLE requires a literal, which is either a string or a number. What we really want on the right side is more general: an expression. An expression is something that eventually resolves to a value. So, variable names are expressions, function calls (when we have them) will be expressions, and things like 2 + 2 would be expressions if we implemented them.
So, let's introduce expressions, which are either literals, or identifiers (so far):
>>> expression = literal ^ identifier
>>> assignment = Group(identifier + Literal('=') + expression)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString("foo = 42; bar = foo")
[['foo', '=', 42], ['bar', '=', 'foo']]So, a LOLE expression in an assignment can currently be three kinds of things:
- A number:
42 - An identifier:
'foo' - A string:
'"foo"'
What happens if we run this through run_program?
>>> run_program("foo = 42; bar = foo")
{'foo': 42, 'bar': 'foo'}That is not really right, is it? We would want bar to be set to 42. The problem is that we are not calculating the value of expressions, but just passing them along. We need eval_expression:
>>> vars = {}
>>> def eval_expression(expr):
... if isinstance(expr, str):
... if expr[0] in '"\'': # string literal
... return expr[1:-1] # remove quotes
... else: # identifier
... return vars.get(expr)
... # Number, return as-is
... return exprThis solves that problem of "resolving" variables, and also that of extraneous quotes in our string literals. And then we need to use it in run_program, too:
>>> def run_program(code):
... parsed = program.parseString(code)
... for name, _, value in parsed:
... vars[name] = eval_expression(value)
... return vars
>>> run_program("foo = 'quux'; bar = foo")
{'foo': 'quux', 'bar': 'quux'}