Yesterday I wanted to parse something like this:
alpha
beta
gamma
delta
epsilon
zeta
And end up with an AST nested like this: (alpha(beta, gamma), delta(epsilon(zeta)))
. The actual type of the AST could be just a list of lists of strings, or actual Node
objects; it's all isomorphic. That is, the AST contains exactly the same hierarchy as the text. Assume for simplicity that we want to hardcode indentation as being two spaces.
I came up with two possible ways to solve this.
-
Recursively parse indentation. That is, enter a subrule whenever parsing things that are more indented. Maybe there will be a little backtracking of whitespace whenever subrules fail. This feels like the "right" approach.
-
Build AST after parsing all the lines. That is, register all the indentations, and then based on those create the whole AST in one go. Will make the parsing way easier, but feels like throwing away information only to rebuild it later.
I went with the first approach, but didn't get far because I wasn't fit to program yesterday, apparently. Going to try again today, and thought I might as well make it into a mini-challenge to let others have a go, too.
It's probably a good idea to write tests for this. 😁
Low-level solution (ie bypassing the grammar engine):