Skip to content

Instantly share code, notes, and snippets.

@samebchase
Last active November 29, 2020 13:35
Show Gist options
  • Save samebchase/8d37f7d961ecc981e9ab9a5d36763dd5 to your computer and use it in GitHub Desktop.
Save samebchase/8d37f7d961ecc981e9ab9a5d36763dd5 to your computer and use it in GitHub Desktop.
[WIP]: Raku Grammars

Parsing Clojure namespace forms using Raku grammars

One day, I started wondering if it would be possible to parse Clojure namespace forms and generate a dependency graph of the various namespaces used in a real-world Clojure project. While that was the original motivation, I ended up down the Raku grammar rabbit hole, and had an enjoyable time learning how to use them. I'm glad you're joining me in reliving that journey.

Background

Grammars

Informally speaking, grammars can be thought of as a set of rules that describe a language. With these rules, one can meaningfully parse (to make sense of, or deconstruct into its grammatical components) a piece of text. It turns out that this is a common task in computing. We need to frequently translate programs from one language to another. This is the job of a compiler. Before being able to translate it, the compiler needs to know whether the original program is even valid, according to the language's grammar.

While we have explained in theory what grammars are, Raku grammars help us model abstract grammars as a programming construct (the grammar keyword and its adjacent helpers) using which we can perform parsing tasks. It is important to understand this distinction.

First class grammars are considered one of the revolutionary features of Raku. Normally, you'd find grammars as a library or a standalone tool, but Raku has embraced it wholesale, and has a powerful implementation of grammars which makes light work of most parsing tasks.

Clojure

Clojure is a modern lisp, and happens to be the language I use in $DAYJOB. At the top of most Clojure files, there is a namespace form importing various internal and external namespaces. This way we can neatly organize our project into many separate files, rather than having to put them all into one big file. We are shortly going to design a grammar that can parse these namespace forms.

Grammar::Tracer

Grammar::Tracer is helpful in figuring out where your grammar fails to match, and is invaluable in debugging. Make sure you do zef install Grammar::Tracer before running the code examples.

Let's start cookin'

A trivial example

A Clojure namespace is a Lisp form, which as expected starts with an open paren and ends with a close paren. Let's write a grammar for one of the simplest Clojure forms, the empty list.

()
grammar EmptyLispForm {
    token TOP { <lparen> <rparen> }

    token lparen { '(' }
    token rparen { ')' }
}

This is one of the simplest possible grammars we can write. Parsing always starts with the TOP token, and recurses into the various component tokens from there. We just have two such tokens lparen and rparen which are used to denote the left and right parens. Play around with trivial.raku to see how we can parse this.

Warmup: Slaying a cliched interview question using Raku grammars

Q: Write a program that checks that parentheses are balanced.

For example:

() ;; balanced
(()) ;; balanced
(()(())) ;; balanced
(()(((()))) ;; unbalanced
grammar BalancedParens {
    token TOP { <balanced-paren> }

    token balanced-paren { <lparen> <balanced-paren>* <rparen> }

    token lparen { '(' }
    token rparen { ')' }
}

It's likely I've made this more wordy than necessary, but that's still only six rather-readable lines.

Now, under time pressure which seems more likely? Coding up a stack and fiddling with corner cases or using grammars with the awesomeness of Grammar::Tracer to quickly hack out a declarative solution.

It turns out, that we have just tackled one of the trickiest aspects of writing a real-world grammar, and that is dealing with nested structures. As programmers we know that when we see nested structures, we know we will have to deal with recursion in some form.

You can play around with balanced-parens.raku program on the console, and observe how the grammar is parsed.

Note: It turns out there is a better way to parse nested structures, but this is fine for now.

Parsing our first namespace form

Let's try to parse this simple namespace declaration:

;; 1   3
   |   |
;; (ns my-own.fancy.namespace)
    |                        |
;;  2                        4

While this is simple, it is going to be an important building block in tackling more complicated namespace forms. Let's break this down into its four component lexemes. We can see the open and close parens, we see "ns", the namespace my-own.fancy.namespace and finally the close paren. That's it! Let's tackle these individual pieces using a grammar.

grammar SimpleNs {
    token TOP { <simple-ns> }

    #                  1        2            3         4
    #                  |        |            |         |
    token simple-ns { <lparen> <ns-keyword> <ns-name> <rparen> }

    token ns-keyword { 'ns' }
    token ns-name { <.ns-name-component>+ % '.' }
    token ns-name-component { ( <.alnum> | '-' )+ }

    token lparen { '(' }
    token rparen { ')' }
}

Over here we can see that we have translated this into a simple Raku grammar. You could argue that defining simple-ns is not even required, we could have just put it in TOP directly, but anyway.

We will need to deal a little with regexes here. In the various flavours of regexes, + usually means one or more. | has a slightly different meaning from what you're expecting, but you can check the regex documentation for all the details on the difference between | and ||. Loosely speaking, we are saying that a namespace component, i.e. the thing between two dots, is made up of one or more alphanumneric characters or hyphens. Now, if there is a rule saying a namespace has to start with an alphabetic character, and not a number, the grammar will become a little more complex, but this is a pedagogic example, so we'll not be too pedantic.

I expect eagle-eyed readers to point out a few things:

  1. Where is <.alnum> being defined, and why does it have a dot before it?

alnum is predefined. The reason it has a . before it is so that we are not interested in capturing each letter; it's too low level. We are interested in capturing a top-level token like ns-name instead, and not each individual character. Play around with the code examples by adding and removing a dot from various tokens and see the difference in Grammar::Tracer's output.

  1. What does % mean?

This is a very useful convenience to describe patterns where something is interspersed between a bunch of other things. For example, we can have a namespace like foo.bar.baz.

token ns-name { <.ns-name-component>+ % '.' }

This means that ns-name is made up of at least one ns-name-component's (denoted by the +) and separated by a . (denoted by %).

Okay, so this should work I guess? Let's see what happens when we run the code!

No, that didn't work. As Grammar::Tracer so helpfully tells us, we did not account for the space after ns. In traditional compiler theory, there is a process of tokenisation, where some lightweight regexes are used to separate the program into its component lexemes and discard all the extraneous spaces. However, here we will not do that, and we'll deal with it in the grammar itself. Now, it could be argued whether that is a good decision, but that's another discussion. So, let's add some whitespace. I got myself into a position where I liberally used to sprinkle <.ws>* indicating zero or more whitespace characters in places where I felt that we should allow optional whitespace, as you would expect in a real-world program.

token simple-ns { <lparen> <ns-keyword> <.ws> <ns-name> <rparen> }

With that tiny addition, we are now able to parse the simple namespace form. You can play around with simple-ns.raku.

Let's make our lives a little more difficult

Okay, now we are getting the hang of this, so let's see a namespace form which is a little more realistic.

(ns my-amazing.module.core
  (:require [another-library.json.module :as json]
            [yet-another.http.library :as http]))

This is a totally realistic namespace form which we shall parse by adding support for the :require form where other libraries are imported and given a short nickname.

Can this be done? You bet!

grammar RealisticNs {
    token TOP { <realistic-ns> }

    token realistic-ns { <lparen>
                           <ns-keyword> <.ws> <ns-name> <.ws>
                           <require-form>
                         <rparen> }

    token ns-keyword { 'ns' }

    token ns-name { <.ns-name-component>+ % '.' }
    token ns-name-component { ( <.alnum> | '-' )+ }

    token require-form { <lparen>
                           <require-keyword> <ws>? <ns-imports>
                         <rparen> }

    token require-keyword { ':require' }

    token ns-imports { <ns-import>+ % <.ws> }

    token ns-import { <lsquare>
                        <ns-name> <.ws> ':as' <.ws> <ns-nickname>
                      <rsquare> }

    token ns-nickname { <.alnum>+ }

    token lsquare { '[' }
    token rsquare { ']' }
    
    token lparen { '(' }
    token rparen { ')' }
}

Nothing too scary as yet. We can see how the grammar evolves. At the top level, in realistic-ns we have added an extra token called <require-form> and we flesh out the details later. We can manage complexity in this manner, so that we have the ability to zoom in and out of the details as necessary.

Using the parsed data

Now that we have been able to parse the data, we need to make use of what we've parsed. This is where Actions come into the picture.

When we do RealisticNs.parse(...) that returns a Match object corresponding to the RealisticNs grammar. While we can query that object to get the pieces of data that we require, it is less cumbersome to use Actions to build up the data we are actually interested in.

Given, a namespace, we want to extract out:

  1. Namespace name
  2. Imported namespaces
  3. Imported namespace nicknames

Using these three pieces of data, given a list of files (that are simple enough to be parsed by our grammar 😜), we can extract the information we need.

The simple principle to follow is that for the token we are interested in, we create an Action method with the same name. When the Match object is being built up, the token Action methods run when the token is matched. Grammars are parsed top-down, but data is built up by the Action methods in a bottom-up fashion.

class RealisticNsActions {
    has $!ns-name;
    has $!imported-namespaces = SetHash.new;
    has $!ns-nicknames = SetHash.new;

    method TOP($/) {
        make {
            ns-name => $!ns-name,
            ns-imports => $!imported-namespaces,
            ns-nicknames => $!ns-nicknames
        }
    }
    
    method ns-name($/) {
        $!ns-name = $/.Str;
    }

    method imported-ns-name($/) {
        $!imported-namespaces{$/.Str}++;
    }

    method ns-nickname($/) {
        $!ns-nicknames{$/.Str}++;
    }
}

Here, we have created a RealisticNsActions class and created methods where we want to do something with the data associated with it. We don't need to touch the grammar definition at all (which keeps it clean). The only extra thing we need to do, is while parsing, we need to pass the Actions object like this, which instructs Raku to run the token Action methods when it sees those tokens.

sub MAIN() {
    my $s = RealisticNs.parse(slurp("realistic.clj"), actions => RealisticNsActions.new);
    say $s.made;
}

In an Actions class, the TOP method can be used to generate the final payload that we can access by calling the made method. For more information on make and made the official Raku Grammar tutorial makes it clear. In short, arbitrary payloads created using make can be accessed by using made.

When we run this program, this is what we see:

{ns-imports => SetHash(another-library.json.module
                       yet-another.http.library),
 ns-name => my-amazing.module.core,
 ns-nicknames => SetHash(http json)}

As expected, we can see the parsed data is created in a nice HashMap, but wouldn't it be nice to know that yet-another.http.library's nickname in the namespace is http? This is where we run into a design issue in the Actions class we have just written. We are building payloads at a lower level, than what we should.

We need more structure to be able to get the namespace -> namespace-nickname mapping that we want. A quick glance at the grammar tells us that we can find it at the ns-import level, because it's subtokens are imported-ns-name and ns-nickname, and those two are the pieces of data that we want.

Let's write an Action method for ns-import!

class RealisticNsActions {
    has $!ns-name;
    has $!imported-namespaces = SetHash.new;
    has %!ns-nicknames;

    method TOP($/) {
        make {
            ns-name => $!ns-name,
            ns-imports => $!imported-namespaces,
            ns-nicknames => %!ns-nicknames
        }
    }

    method ns-name($/) {
        $!ns-name = $/.Str;
    }

    method ns-import($match) {
        #say $match;
        
        my $imported-ns-name = $match<imported-ns-name>.Str;
        my $ns-nickname = $match<ns-nickname>.Str;

        $!imported-namespaces{$imported-ns-name}++;
        %!ns-nicknames{$imported-ns-name} = $ns-nickname;
    }
}

which results in the output:

{ns-imports => SetHash(another-library.json.module 
                       yet-another.http.library),
 ns-name => my-amazing.module.core,
 ns-nicknames => {another-library.json.module => json,
                  yet-another.http.library => http}}

Now we have all the information we require. You can play around with realistic-ns-with-actions.raku.

Match objects in Action classes

The result of a successful parse is a Match object. This contains the entire heirarchical parsed structure.

Whatever we did in the ns-import method, we could have done at a higher level, but the Match object at that level would need more querying. This is because the method will receive its "view" into the full Match object, i.e. TOP will have the entire Match object, and the ns-import will have a more restricted view using which we could easily extract out imported-ns-name and ns-nickname. This might not make sense immediately, but after dealing with Match for some time, you'll see how it makes sense to extract out useful information at the lowest possible level which allows for easier querying. At the TOP level, to extract out ns-nickname we would have had to query realistic-ns -> require-form -> ns-imports -> ns-import -> ns-nickname which is cumbersome to say the least, and because there are multiple of these, there will be an array of ns-import.

To see this in action, in each Action method add a say $match or say $/ as appropriate to see what the structure at that level.

Development Style

Raku does not as yet have a full-featured REPL environment that a Lisp programmer may be used to. That is just something that needs to be worked around.

The Raku REPL could be used to quickly test short one-line snippets of code, mostly as the simplest of sanity checks, but it gets unwieldy with anything more complex than that.

To get around this, I used a TDD approach of sorts, where I would take a real-world Clojure project, and run the (rapidly evolving) grammar on all the Clojure files. With every "correct" change the number of parsed files increases, and with every "wrong" change, we would not be able to get as far.

Next steps

With what we've tackled so far, it is not much of a stretch to parse real world Clojure namespace declarations as well. For example, we have added support for importing Clojure namespaces using the :require form. Similarly, we could add support for the :import form using which we import Java libraries. The same iterative approach can be used to parse increasingly complicated code. In the zip file attached, I have added a real-world grammar using which I have been able to parse hundreds of Clojure files. You may notice that I have to handle lots of syntactic variation, and optional whitespace, but I believe we have extracted the crux of the implementation into the easily understood grammar that we have discussed in detail in this article.

The final Clojure NS Grammar for reference. Using this grammar to generate the dependency graphs is a story left for another day.

Caveat Emptor (and a few big ones):

There is a chance that when a Grammar is improperly specified (and you will run into this a few times during experimentation), that the program just hangs. You can think of this as what is happening as something with pathological backtracking. This is not something only Raku grammars, or grammars in general, can suffer from. A badly written regex can have the same effect too, so one must be aware of this contingency before putting a regex/grammar in the hot path of a highly-critical application. There are high-profile post mortems discussing how a single regex brought down web applications to their knees.

While dealing with regexes the common advice is to avoid using .* (any number of any character), and instead be more restrictive in what that matches with, and that advice holds true for grammars too. Be as restrictive as possible, and relax certain things when it is unavoidable. In the final code, I was able to parse hundreds of real Clojure ns-declarations, but during the process of evolving the grammar every once in a while, I did run into this behaviour a few times, and that was remedied by tweaking the grammar as described. It is done intuitively, and being able to reliably fix it, is a dark art that most (including yours truly) do not confidently possess.

The other thing to be wary of in a production system is, that even if a grammar is well-specified, could it be possible for a malicious user to carefully craft an input that triggers this pathological behaviour? Given enough motivation, anything is possible, so all bets are off. 🤞

Grammars are more powerful than regular expressions (https://en.wikipedia.org/wiki/Chomsky_hierarchy). The traditional warning of not using regexes to parse things it isn't powerful enough to parse does not apply here, assuming the grammar is well-written, but the idea of a well-written grammar is as elusive as a bug-free program. I don't mean the grammar in the original design of the language, but the grammar you are writing to parse something. Without properly dealing with corner-cases, and adequately handling errors, you are going to end up with a brittle and frustratingly difficult to debug program.

Conclusion

Now that we have reached the end, I hope this is another article floating out on the internet that will help fledgling Rakoons (can Rakoons fly? 🤔) understand how Raku grammars can be put to real world tasks and can be a powerful tool, when the time comes.

References

End Credits

  • All the Raku designers and contributors over the years.
  • Jonathan Worthington for creating Grammar::Tracer.
  • All the friendly Rakoons (@JJ) who helped with reviewing this article.
@Xliff
Copy link

Xliff commented Nov 29, 2020

Excellent write up! Thank you for contributing

@samebchase
Copy link
Author

Hey thank you @Xliff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment