Skip to content

Instantly share code, notes, and snippets.

@disco0
Forked from EthanRutherford/readme.md
Created October 9, 2020 04:04
Show Gist options
  • Save disco0/40f7d6076e2530e5b4e33dd434ce4e6a to your computer and use it in GitHub Desktop.
Save disco0/40f7d6076e2530e5b4e33dd434ce4e6a to your computer and use it in GitHub Desktop.
Regex but better

RegexButBetter

changes:

  • whitespace is no longer meaningful, and can therefore be used for formatting
    • this means whitespace must be escaped, using existing constructs like \n, \t, and a new escape for singleSpace \ (exact recipe open for discussion)
  • (capture) group constructs are totally rearranged, to allow for easier non-capturing grouping and reduction of "symbol soup" of current regex patterns
    • non-capturing group is assigned the bare ( so that the easiest-to-type grouping construct does not capture, and pollute the capture result array
      Motivation: using (?: just to be able to | a few options looks nasty
    • lookahead and lookbehind are modified to remove inconsistencies that exist for legacy, backward-compatibility reasons
      • (>= = positive lookahead
      • (>! = negative lookahead
      • (<= = positive lookbehind
      • (<! = negative lookbehind
    • capture groups are changed to be more explicit (accomplished by no-longer being the default group construct) and to add consistency between named and numbered capture groups
      • (?# = standard capture group, added to capture array (# is literal)
      • (?name = named capture group (name is separated from captured expression by whitespace)
  • add support for comments - a line which (ignoring whitespace) starts with // is entirely ignored by the parser, to allow programmers to document complex regexes

Guiding examples

// this comment is ignored, making this an empty regex
// whitespace is ignored, so the following matches "hello world" exactly
		h ell o \ worl    d
// the following groupings do not pollute the capture array
// (yes, I'm aware this isn't a proper url matcher)
(http|https|ftp)://(\w+\.)*example\.com(/\w+)*
// capture groups are now opt-in instead of opt-out
(?#numbered-capture)
(?named1 named-capture)
(?named2
	this illustrates how whitespace can make for more readable regex
)

Open questions

  • Requiring an escape for single-space might not be worth it. It certainly makes matching a single space less ergonomic. Even in existing regexes, one can easily argue that tabs and newlines are best represented (most readible) as their respective escapes, and so forcing using them feels like a positive side effect of ignoring whitespace. But for single spaces, the escape feels pretty unfortunate, in adding noise, contributing more to symbol-soup, and being weird to type. Probably need usage feedback to tell if the ability to use spaces for formatting is useful enough to justify the loss of ergonomics for matching single-spaces.
  • is changing around the syntax of all the group constructs worth the confusion? It's highly probably it will trip up people who are used to the existing syntax at first. Is the improved ergonomics and consistency worth it?
// proof of concept rudimentary transform method
// translates a regex source as described above to a valid javascript regex
function parseRegex(source) {
const root = {children: []};
let currentNode = root;
let currentText = '';
function split() {
if (currentText.length > 0) {
currentNode.children.push(currentText);
currentText = '';
}
}
for (const line of source.split("\n").map(l => l.trim())) {
if (line.startsWith("//")) {
// line is a comment
continue;
}
let index = 0;
while (index < line.length) {
const char = source[index++];
if (char === "\\") {
currentText += char + line[index++];
continue;
} if (char === "\t" || char === " ") {
continue;
} else if (char === "(") {
split();
const group = {
parent: currentNode,
children: [],
};
if (line.startsWith(">=", index)) {
group.kind = "positive lookahead";
index += 2;
} else if (line.startsWith(">!", index)) {
group.kind = "negative lookahead";
index += 2;
} else if (line.startsWith(">=", index)) {
group.kind = "positive lookbehind";
index += 2;
} else if (line.startsWith(">!", index)) {
group.kind = "negative lookbehind";
index += 2;
} else if (line.startsWith("?#", index)) {
group.kind = "numbered capture";
index += 2;
} else if (line.startsWith("?", index)) {
group.kind = "named capture";
group.name = "";
while (++index < line.length && line[index] !== ' ') {
group.name += line[index];
}
} else {
group.kind = "non-capture";
}
currentNode.children.push(group);
currentNode = group;
} else if (char === ")") {
split();
currentNode = currentNode.parent;
} else {
currentText += char;
}
}
}
return root;
}
function translateCore(nodes) {
let content = "";
for (const node of nodes) {
if (typeof node === "string") {
content += node;
} else if (node.kind === "positive lookahead") {
content += `(?=${translateCore(node.children)})`;
} else if (node.kind === "negative lookahead") {
content += `(?!${translateCore(node.children)})`;
} else if (node.kind === "positive lookbehind") {
content += `(?<=${translateCore(node.children)})`;
} else if (node.kind === "negative lookbehind") {
content += `(?<!${translateCore(node.children)})`;
} else if (node.kind === "numbered capture") {
content += `(${translateCore(node.children)})`;
} else if (node.kind === "named capture") {
content += `(?<${node.name}>${translateCore(node.children)})`;
} else if (node.kind === "non-capture") {
content += `(?:${translateCore(node.children)})`;
}
}
return content;
}
function translateRegex(source) {
const tree = parseRegex(source);
console.log(tree);
return `/${translateCore(tree.children)}/g`;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment