Skip to content

Instantly share code, notes, and snippets.

@masak
Created September 13, 2012 08:50
Show Gist options
  • Save masak/3712987 to your computer and use it in GitHub Desktop.
Save masak/3712987 to your computer and use it in GitHub Desktop.
Mini-challenge: parsing indented input into an AST

Yesterday I wanted to parse something like this:

alpha
  beta
  gamma
delta
  epsilon
    zeta

And end up with an AST nested like this: (alpha(beta, gamma), delta(epsilon(zeta))). The actual type of the AST could be just a list of lists of strings, or actual Node objects; it's all isomorphic. That is, the AST contains exactly the same hierarchy as the text. Assume for simplicity that we want to hardcode indentation as being two spaces.

I came up with two possible ways to solve this.

  • Recursively parse indentation. That is, enter a subrule whenever parsing things that are more indented. Maybe there will be a little backtracking of whitespace whenever subrules fail. This feels like the "right" approach.

  • Build AST after parsing all the lines. That is, register all the indentations, and then based on those create the whole AST in one go. Will make the parsing way easier, but feels like throwing away information only to rebuild it later.

I went with the first approach, but didn't get far because I wasn't fit to program yesterday, apparently. Going to try again today, and thought I might as well make it into a mini-challenge to let others have a go, too.

It's probably a good idea to write tests for this. 😁

@gerdr
Copy link

gerdr commented Sep 13, 2012

Low-level solution (ie bypassing the grammar engine):

use v6;

sub parse ($src)  {
    my $ast = [];
    my @stack = $ast;

    for $src.lines -> $line is copy {
        next if $line ~~ /^\s*$/;

        my $level = 1;
        while $line ~~ /^'  '/ {
            $line .= subst(/^'  '/, '');
            ++$level;
        }

        while $level < @stack {
            @stack.pop;
        }

        die 'incorrect nesting'
            unless $level == @stack;

        given [] {
            @stack[*-1].push($line => $_);
            @stack.push($_);
        }
    }

    return $ast;
}

say parse($_).perl given '
alpha
  beta
  gamma

delta
  epsilon
    zeta
';

@dginev
Copy link

dginev commented Sep 13, 2012

Randomly stumbled on this teaser in #perl6 and it itched enough to give it a go.
The code below is my first grammar in Perl6 so you can expect all kinds of misuses of the intended constructs.

grammar Ident {
    token TOP {^ [<line> \n]* <line> \n? $}
    token line { <ident> <word> }
    token word { \w+ }
    token ident { \s* }
}

class Ident::Actions {
    method TOP($/) {
        my $ast = ().list;
        for $<line>».ast -> $entry is copy {
            my $list = $ast;
            $list = $list[$list.end] while ($$entry<level>-- > 0 );
            @$list.push( [ $entry<word>.Str ] );
        }
        make $ast;
    }
    #Note: Assuming 2 spaces for an indent level
    method line($/)  { make {word=>$<word>, level=>$<ident>.chars / 2 } }
}

my $req = 
'alpha
  beta
  gamma
delta
  epsilon
    zeta';

say Ident.parse($req, :actions(Ident::Actions.new)).ast.perl();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment