changes:
- whitespace is no longer meaningful, and can therefore be used for formatting
- this means whitespace must be escaped, using existing constructs like
\n
,\t
, and a new escape for singleSpace\
(exact recipe open for discussion)
- this means whitespace must be escaped, using existing constructs like
- (capture) group constructs are totally rearranged, to allow for easier non-capturing grouping and reduction of "symbol soup" of current regex patterns
- non-capturing group is assigned the bare
(
so that the easiest-to-type grouping construct does not capture, and pollute the capture result array
Motivation: using(?:
just to be able to|
a few options looks nasty - lookahead and lookbehind are modified to remove inconsistencies that exist for legacy, backward-compatibility reasons
(>=
= positive lookahead(>!
= negative lookahead(<=
= positive lookbehind(<!
= negative lookbehind
- capture groups are changed to be more explicit (accomplished by no-longer being the default group construct) and to add consistency between named and numbered capture groups
(?#
= standard capture group, added to capture array (#
is literal)(?name
= named capture group (name is separated from captured expression by whitespace)
- non-capturing group is assigned the bare
- add support for comments - a line which (ignoring whitespace) starts with
//
is entirely ignored by the parser, to allow programmers to document complex regexes
// this comment is ignored, making this an empty regex
// whitespace is ignored, so the following matches "hello world" exactly
h ell o \ worl d
// the following groupings do not pollute the capture array
// (yes, I'm aware this isn't a proper url matcher)
(http|https|ftp)://(\w+\.)*example\.com(/\w+)*
// capture groups are now opt-in instead of opt-out
(?#numbered-capture)
(?named1 named-capture)
(?named2
this illustrates how whitespace can make for more readable regex
)
- Requiring an escape for single-space might not be worth it. It certainly makes matching a single space less ergonomic. Even in existing regexes, one can easily argue that tabs and newlines are best represented (most readible) as their respective escapes, and so forcing using them feels like a positive side effect of ignoring whitespace. But for single spaces, the escape feels pretty unfortunate, in adding noise, contributing more to symbol-soup, and being weird to type. Probably need usage feedback to tell if the ability to use spaces for formatting is useful enough to justify the loss of ergonomics for matching single-spaces.
- is changing around the syntax of all the group constructs worth the confusion? It's highly probably it will trip up people who are used to the existing syntax at first. Is the improved ergonomics and consistency worth it?