Bridging regular expressions and "real" parsers

Problem

Regular expressions are great. Every language supports them; everyone knows (generally) how to use them. They can solve a lot of ad-hoc parsing problems, and pretend to solve many others if you don't look too hard.

But they have their downsides. They're hilariously compact, even with /x, making them very hard to read and maintain once they grow beyond the trivial. They have unintuitive backtracking behavior, leading to surprising matches and performance gotchas on most implementations. They don't handle context-sensitive grammars very well. And very few regex engines make it easy to debug a regex that doesn't match where you think it should.

And yet most programmers will still turn to a regex long before a parser generator, because parser generators tend to be entire new large systems with multiple moving parts and even more obtuse restrictions.

Perl 6 sought to improve on this state of affairs dramatically with "rules", a regex-inspired system for describing grammars built right into the language. This never came to fruition, and some of the ideas are outlandish (much like the rest of Perl 6...), but the goal is still a noble one.

Thus I would like to design a parsing system that's so easy to use for ad-hoc problems that it "scales down" to the same problem space as regexes. I'm tentatively calling the language and system pegex.

Requirements

Use PEG, with the left recursion elimination used in OMeta. Ideally pegex will be able to match any "reasonable" grammar without contortions, and PEG is the best algorithm we have for that.

PEG also automatically provides several other nice properties:
- No separate lexing step.
- Linear runtime. (Assuming packrat. There are still some superlinear edge cases with packrat + LR, but they tend to be very contrived. I have seen a paper somewhere with an elegant way to parse even those edge cases in linear time, but I can't seem to find it again.)
Must have its own DSL, of course.
No "actions", i.e., no host code embedded directly in the grammar. Pegex will have the best chance at success if it's not tied to any particular language.

Syntax

Pegex will need its own syntax to be remotely as accessible as regular expressions, though the initial implementation will use a combinator library.

The syntax should lean in favor of looking as much like expected input as possible—for example, unadorned letters in regular expressions are taken as literals.
It should be easy to produce a parse tree with minimal additions to the grammar.
Errors should be clear to both the developer and the end user, and perhaps customizable.
Whitespace should be easy to deal with, but not blanket ignored.
Grammars and rules should be composable in a manner similar to Perl 5 regular expressions.
The library should be performant enough to use for large input strings, with minimal exponential-time edge cases.
Partial matching should be plausible someday, e.g. building a partial AST out of the context of a diff by guessing what must have come before.
When possible, matching should continue after an error, to avoid the fix-and-retry cycle. (But don't spew a zillion errors, either.)
Grammars should be composable and extendable.
Be clever! Take all kinds of shortcuts whenever possible.
Allow injecting native code without having to write it directly into the syntax.
Must be able to construct an AST (or edit one) and serialize it!

Particular awkward cases of note

Binary infix operators with varying precedence and associativity.
Comma-delimited lists, with optional trailing commas.
HTML
Scanning a document for things that look like URLs

Early thoughts

Good ideas in Perl 6

Perl 6's rules system has separate blocks for regex versus token. A token uses the same syntax, but disables backtracking; thus, extracting a "number" token from 123foo will only ever produce 123, never 12. In other words, tokens are always maximally greedy. This seems to intuitively match how humans would think of tokens.

Perl 6 allows for clustering multiple rules together in a grammar, which is really just a class. A grammar is its own namespace, and so rules can refer to each other by simple names without worrying about cluttering the global namespace. This rings similar to OMeta's approach, which allows for single-inheritance of grammars.

Possible approach to implementation

If this system is to have its own syntax, then something needs to implement that syntax, and it might as well be the parser-generator library I'm writing. To that end, for the purposes of bootstrapping, it'll need a combinatorial approach to start with akin to pyparsing's interface.

Having this is convenient anyway: the combinators can later double as AST nodes, and the new syntax can be described with them directly to avoid bootstrapping problems. This also ensures that rules and grammars can be composed programmatically.

Possible syntax

Whitespace and comments are always ignored.

Features that may exist, taken from both current regular expressions, Perl 6 rules, PEGs, and zsh path globbing:

Literal character
Character class
Optional
Zero or more
One or more
Specific count
Disjunction
Conjunction
Atomic conjunction
Rule application
Input start/end
Line start/end
Word boundary
Delimiters
Variable interpolation
Literal disjunction

TODO syntax :)

takin some quick notes; please fill in with other stuff i thought last night and update for real

status: combinators, sort of
DO NOT want actions, but maybe allow them with an OO syntax using decorators so they stay "in the language"
DO want particular handling for the sticky parser cases: optional whitespace, quoted strings, infix operators, and delimited lists
should support both PEG operations and regex operations: full parse, partial parse, find all, split on
- relatedly, i sometimes find myself wanting to do a regex "find all but don't skip any chars" so maybe that needs mention (or could just write another rule)
things to think about
- capturing... regexes produce pretty simple output, not a damn AST!
why are regexes popular?
- simple stuff built in: symbols for whitespace, digits, letters, word boundaries
- learn once use everywhere
some thoughts on syntax
- would be nice if bare letters were literal matches like regex, NOT rule application?
- don't use backslash!
- everything except letters (underscore, hyphen) should be reserved so there's never any question about what needs quoting
crazy ambitions that involve lots of examining the grammar:
- automatic commit points (where i can discard the packrat state and fail fast)
- "allow whitespace where not ambiguous"?
- detect ambiguity?
- parsing a document fragment by guessing which rule starts it (this needs a catchy name)
- semi-automatic syntax highlighting
- automatic pretty printing by examining
- lazy streaming parsing, both input and output, like lxml (requires commit points obv.)

eevee/gist:6569492