Skip to content

Instantly share code, notes, and snippets.

@ajs
Last active November 1, 2016 20:42
Show Gist options
  • Save ajs/72fecbbe8e714ce209ec8da171e74186 to your computer and use it in GitHub Desktop.
Save ajs/72fecbbe8e714ce209ec8da171e74186 to your computer and use it in GitHub Desktop.
Thoughts on a PCRE-like generic version of Perl 6 regexes

#Toward a Specification for Perl 6-like Regular Expressions

##Notation

"..." will be used throughout this document to indicate some arbitrary content, which should be explained subsequently.

Otherwise, very little is left to the reader to imagine other than when we get to language-specific concepts.

##Unicode

Perl 6 regex matching is Unicode-only by default. The rules for matching Unicode are simple: where we say "character" below, we actually mean composed, normalized grapheme. In Perl 6, it's actually much more aggressive. Input strings are converted to NFG, a Perl 6 form which asserts that all graphemes will be represented by a single codepoint, forcing the creation of synthetic codepoints as needed. Other languages will probably not be this aggressive, though it certainly does make a regex implementation saner...

##Introduction

Perl 6 overhauled the notion of regular expressions. In order to apply this to the development of generic tools for other languages the way Perl-like regular expressions proliferated to other languages, both in individual language implementations and in the PCRE package, we need to isolate what it is that Perl 6 has and what we want to implement elsewhere.

At a high level, there are three main types of features:

  • Generic text matching
  • Recursive parsing
  • Embedded code

The first is what something like PCRE already covers. There are some changes in Perl 6 regexes, but most of those changes simply substitute one syntax for another. A few truly new features exist, even here, though, and we'll need to include those in any specification of Perl 6-like regexes.

The second is a definite layer on top of the base, and probably an optional (modular) component to be specified separately. This is the allowance for sub-rules which match with the same backtracking semantics as any other atom.

The third is where we need to isolate Perl 6 language features from the regex specification itself. Everything from the way variables are interpolated to the way code blocks form closures needs to be isolated and genericized.

Generic Text Matching and Embedded Code

At its heart, a Perl 6 regex is still a regex. It is a sequence of atoms with optional repetition quantifiers, grouping and alternation, etc. These are all well-known features. Here's the basic subset required of any implementation:

###Atoms

  • Alpha, digit or underscore - Literal
  • '...' or "..." - Literal string
  • Whitespace - Ignored (see significant whitespace modifier)
  • \ - Escapes following character as literal unless it is alpha in which case, see escapes
  • (...) - A captured group
  • [...] - A non-captured group
  • <...> - Variable. Dependent on character(s) after less-than
  • . - Any character
  • {...} - Encloses a code block (it does not take part in matching and may be backtracked over)

A note on code blocks - Because a general implementation may not know about any given client language, the meaning of "code block" might be ambiguous. For that reason, a code block is simply defined as any text between {...} with enclosed braces being balanced e.g. the Python example, {print({"so far": match.prematch, "next": match.postmatch})}. In "compiled" languages (e.g. those that read and parse these expressions at a point where the language's compiler is no longer available), these features may simply not be available.

###Escapes

  • \d and \D - single Unicode digit (property "N"). \D negates.
  • \h and \H - horizontal whitespace (such as space or tab). \H negates.
  • \n and \N - logical newline, including system-specific line-break patterns. \N negates.
  • \s and \S - Whitespace of any sort. \S negates.
  • \t and \T - Tab character only. \T negates.
  • \v and \V - Vertical whitespace (such as vertical tab or form feed). \V negates.
  • \w and \W - A word character (Unicode category "L", property "N" or underscore). \W negates.
  • \x[...] and \X[...] - Match the given character by Unicode codepoint (in hexadecimal) e.g. \x[ffff]. \X[...] negates.

###Zero-width atoms

  • ^ and $ - Beginning and end of text
  • ^^ and $$ - Beginning and end of line (with "line" being defined per \n semantics)
  • << and >> - Left and right word boundary (see <ws>)

###Quantifiers

###Quantifiers come after any atom except for an already quantified atom.

  • ? - Zero or one times
  • + - One or more times
  • * - Zero or more times
  • ** n..m - From n to m times, inclusive.

###Modifiers on quantifiers

  • ? - Non-greedy matching
  • % and %% - Separated matching per following atom (e.g. \d +% ',' matches comma-separated digits). The %% form allows a trailing separator.
  • : - Do not backtrack over the preceding atom. If backtracking would occur, fail.

###Alternations and joiners

  • | - Longest-match alternation of left and right hand side
  • & - Both left and right hand alternation match, but may be evaluated in any order
  • || - First (left-to-right) match alternation
  • && - Left and right hand alternation match, but must be evaluated in the given order
  • ~ - Left and right hand atom are delimiters for the following atom (e.g. for bracketing)

###<...> constructs

The less-than/greater-than bracketing construct is used to begin may sorts of directives in Perl 6 regexes. Each of them is distinguished by how it starts, and we will cover the ones here which are relevant to generic regex matching.

###Character classes

Character classes may be joined inside the <...> bracketing by + or - to indicate union or difference of character sets, e.g. <[A..Z]-[D]>

  • <+...> or <[...]> - Additive character classes
  • <:...> - Unicode property identifier character class. <:!...> negates.
  • <-...> - Negated character classes

In a bracketed character class, .. indicates a range from the character preceding to the character following, by Unicode codepoint.

###Assertions

  • <?before ...> or <!before ... > - Assert that pattern follows the current position, but do not match it. <!before ...> negates.
  • <?after ...> or <!after ...> - Assert that the pattern preceded the current position. <!after ...> negates.
  • <?[...]> or <![...]> - Assert that the given character class follows the current position, but do not match it. <![...]> negates.

###Misc

  • <* ...> - Partial match. Match zero; the first; the first and second; and so on of the atoms enclosed.
  • < words... > - If the first character is whitespace, the match is an alternation of the words inside, e.g. < apple peach pear > is equivalent to [ 'apple' | 'peach' | 'pear' ]
  • <?{...}> or <!{...}> - Code assertion. The enclosed code must return a boolean value to determine whether to continue (true) or abort the match (false). <!{...}> negates.
  • <{...}> - The expression will result in a regex which is to be interpolated at this point in the enclosing regex.

###Builtin "subrules"

Even in an implementation that does not include subrules, the following builtins would be considered part of the base regex implementation:

<ws> or <.ws> or <!ws> - Whitespace matching zero-or more whitespace characters that unambiguously indicate a word boundary (e.g. foo.bar baz would match <ws> three times, once zero-width before the "." once zero-width after and once matching the space). <.ws> is non-capturing and <!ws> negates.

###Named matching and variables

Though not all languages use prefix "sigils" to identify variables, Perl 6-like regular expressions may use named values following dollar signs to indicate parameterized or captured subexpressions.

$... - The named value is interpolated as literal text to be matched. $<...> = - The following atom (with possible quantifier) is captured by the named sub-match e.g. $<word> = \w+ <$...> - The named value is interpolated as regex to be matched.

If the language cannot access surrounding lexical scopes from within a regex, then input values should be passed to the match operation in some way as named parameters.

###Modifiers

A modifier may be provided at the time of the call (language-specific) or it may be provided inline. When provided inline, it applies only to the most deeply enclosed group in which it occurs, and only from the point where it appears, on. So a [b :i c] d would match lower-case a, followed by lower-case b, any case c and lower-case d.

  • :i or :ignorecase - Do all matching case-insensitively.
  • :r or :ratchet - Do not perform backtracking.
  • :s or :sigspace - All unescaped, unquoted whitespace will be interpreted as if it were <.ws>

Perl 6 has other modifiers, but they are really only meant to be used as parameterization to the matching call, which is beyond the scope of this document.

##Parsing and Subrules

[NOTE: Work in progress. Everything below this point will probably change substantially as implementations take shape]

A subrule is a named regex that is referenced from another regex (or itself!) which is akin to a function-call. The only difference is that subrules must support backtracking semantics, and that's a pretty huge difference!

A subrule is specified in some language-specific way. Generally, a "rule" is a regex with :sigspace turned on and a "token" is a regex with :ratchet turned on. Matching is of the form:

<subrule_name>

So, if there were a rule defined called "variable" and another called "expression" then the following rule would match an assignment:

<variable> '=' <value>

While this might match a list assignment:

<variable> '=' '(' ~ ')' [ <value> +% ',' ]

###Parameterization

A subrule may be parameterized with a named or numbered capture, but the syntax is not yet established for this.

Subrules can be used just like any other token.

###Grammars

A grammar is a namespace or logical grouping of subrules which contains at least one entrypoint called "TOP". A grammar can "derive" from another grammar, which means that it will have access to the parent grammar's subrules, but may override them, just like the methods of an object oriented class.

###Actions

A grammar provides some means of invoking a "parse" on an input string, and that means will include an optional "actions" parameter. The actions are native-language functions (usually in the form of methods of an actions class) which expect the current match state as a parameter, and are named the same as the subrules they relate to. An action should return (if that is the appropriate semantic for the given language, note that Perl 6 uses a special semantic behavior called "make") some data structure which represents the state of the match. It may retrieve the state of a sub-match using a language-specific mechanism called on its state parameter. For example, in Perl 6 for the above variable/value example:

class Actions {
    method variable($/) {
        make [ ~$<variable>, $<value>.made ];
    }
    method value($/) {
        make ...
    }
}

Note that a string representation of the variable name is included, but then the "value" is expected to have created some more complex datastrcuture, which is included.

##Philosophy: Why is this better?

In general, there's not a lot about Perl 6-style regexes that's strictly superior in capabilities to Perl 5-style regexes for basic text matching. As said before there are some additional features and many things have been renamed.

But the most important features in terms of comparison to other systems are:

  • A much more readable format, focused on increasing whitespace and reducing the visual noise in commonly used features (e.g. reducing (?: ...) to [...])
  • Subrules for parsing
  • Some additional features such as partial matching and first-match vs. longest-match alternation
  • Opportunity for much improved language integration.
  • More coherent Unicode support

##TODO

  • Convert this into some sort of proper markup
  • Start looking at implementations to see what works and what doesn't
  • Debug examples, perhaps create a functional test suite for future implementations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment