Skip to content

Instantly share code, notes, and snippets.

@rauschma
Last active August 28, 2024 22:43
Show Gist options
  • Save rauschma/8db338a96f9ead8df0714d233e81fed4 to your computer and use it in GitHub Desktop.
Save rauschma/8db338a96f9ead8df0714d233e81fed4 to your computer and use it in GitHub Desktop.

Stateless RegExp methods

The issues

  • The rules for the flags /g and /y are complicated.
  • Regular expression operations keep state in RegExp objects, via property .lastIndex. That leads to several pitfalls if a RegExp object is used multiple times. It also makes it impossible to freeze instances of RegExp.
  • String.prototype.replace() accepts a callback where accessing named captures is inconvenient (via an object that is passed as the last argument).

The goal

  • Improving the RegExp API without introducing a new constructor.

New methods

Common characteristics:

  • Options object:
    • .startIndex = 0
    • .returnIndices = false
  • Don’t change the RegExp object in any way.
    • Completely ignore .lastIndex.
  • Completely ignore the following flags:
    • /g (.global): not needed because each method is either global or non-global
    • /y (.sticky): replaced by the assertion \G
    • /d (.hasIndices): replaced by option .returnIndices
      • Not as important but I’d prefer an option for a method over a RegExp flag because this toggle is more about how an operation works than about how a RegExp matches.

RegExp.prototype.*

  • .execOnce(str, options?): MatchObject
  • .execMany(str, options?): Iteratable<MatchObject>
  • .testOnce(str, options?): boolean

String.prototype.*

Better callback type signature: callback(matchObject)

  • .replaceOnce(stringOrRegExp, stringOrCallback, options?): string
  • .replaceMany(stringOrRegExp, stringOrCallback, options?): string

Open questions:

  • Should these methods forward to Symbol-keyed properties of stringOrRegExp?
    • Another option: turn them into RegExp.prototype.* methods.

“Legacy” RegExp operations that are already stateless

  • String.prototype.search(patternStringOrRegExp): number
  • String.prototype.split(verbatimStringOrRegExp?, limit?): Array<string>

Open question:

  • Would it make sense to support an additional argument options?

“Legacy” RegExp operations that would become deprecated

  • RegExp.prototype.exec
  • RegExp.prototype.test
  • String.prototype.match
  • String.prototype.matchAll
  • String.prototype.replace
  • String.prototype.replaceAll

Assertion \G

  • Matches at the current matching position (0 or .startIndex).
  • Loosely related to the ^ assertion

How should legacy methods handle \G?

  • It clashes with flag /y because with that flag, a regular expression implicitly starts with \G.
    • Thus: throwing an exception when \G is used with /y seems the best option.
  • Other than that, we could specify a “current position” for all legacy methods (sometimes .lastIndex, sometimes 0) and use that with \G.
  • /y is ignored by .split(), so supporting \G would be an improvement.

Important: Is this proposal compatible with upcoming RegExp features?

Potential upcoming proposal: template tag for regular expressions

  • Currently, there is no plan to support multi-line RegExp literals in JavaScript. A template tag is a good alternative and would be very useful for the proposed flag /x.
  • Two TC39 members have expressed an interest in adding a template tags for RegExps to JavaScript (source).
  • A template tag could look like this: https://github.com/slevithan/regex

FAQ

Why the name suffix Many?

  • All is already taken.
  • It’s just a first idea – suggestions welcome!
    • Other options: Multi, OnceOrMore

Why not a single method with (e.g.) an option .many?

  • I find non-overloaded methods easier to understand (they are also easier to statically type): .execOnce and .execMany have different return types.
  • I also like to avoid single big methods that do too much.
  • Precedents for method pairs in the current API:
    • .replace() and .replaceAll()
    • .match() and .matchAll()

Acknowledgement

@slevithan
Copy link

slevithan commented Jul 11, 2024

(Edit: The following feedback was based on an earlier version with significant differences. See also the earlier related discussion here.)

This is great! Lots of good ideas that work well together and provide a cleaner, easier to use, and less surprising API with fewer footguns.

Bikeshedding about naming aside, one concern is how much this bundles into one proposal. Some things have to be bundled, but it can be split into two independent proposals without taking anything away:

  1. New template tag that provides a new regex happy path with always-on best practices and safe, context-aware interpolation.
  2. New syntax \G, plus new RegExp and String methods that improve API signatures and completely move away from lastIndex and flags /dgy (which modify how various methods apply regexes and the shape of their results), as opposed to flags that modify the meaning of regexes (/ims, etc.).

Also, to decouple even more, it can be explicitly stated that flags x and n are not added in the proposal, even though their behavior is always on in the template tag. Nothing stops separate proposals from adding x and n to regex literals and the RegExp constructor (before, after, or alongside the introduction of such a tag).

The flagship improvement for # 2 is of course moving away from the statefulness of regexes, which has long been a source of bugs and developer surprise (here's one example, and it would probably be good to collect more).

\G is a great feature. It's more flexible than /y and it's broadly supported in other regex flavors (.NET, Perl, PCRE, Java, Ruby, Boost.Regex, etc.). But since the way it's most commonly used overlaps with /y, it probably wouldn't make sense to add unless coupled with a proposal like this that also essentially deprecates /y. However, there is still the issue of how exactly it should work. A few options:

  • To avoid overlap with /y, every regex/string method not introduced in the proposal could throw if the regex they're provided uses \G. This is probably easiest, but it's arguably not the best for users and there is no precedent for it, apart from matchAll and replaceAll throwing based on flags.
  • Although it shouldn't rely on lastIndex (since that would probably break the case for introducing it), it could track the match-end position on its target strings (in a property that might or might not be user-visible). This would in any case be an improvement on lastIndex, since particular regexes can of course be applied to more than one string.
  • Existing regex and string methods could be updated to track the last-match-end position internally to their behavior in a way that could be used by \G. This would be needed when using flag /g with string methods replace, replaceAll, match, and matchAll, but wouldn't be needed by search or the RegExp methods exec and test. It would additionally be needed by string split with or without flag /g, since the \G assertion should work like any other assertion (^, etc.). This would offer another improvement on /y since /y is ignored by split.

Open question: Is there a way to switch off flag /n?

It can be disabled via a modifier in the pattern: (?-n:...), and maybe (?-n)... in the future. But apart from that, if you allowed turning it off via some other option, then the same should probably be allowed for disabling x and v. I'd favor not offering an additional option to turn off any flags that are on by default when using the tag. Always-on n might be more controversial than always-on x and v, but it has multiple benefits:

  • It encourages using named captures.
    • Makes regexes more self-documenting.
    • Makes code that uses the results more readable.
    • Avoids errors when refactoring regexes.
  • Makes non-capturing groups more readable by avoiding the syntactic clumsiness of (?:...).
  • If you fully follow regex's behavior for /n, it avoids the footgun of referring to named captures by number.
    • Avoids errors when refactoring regexes.
    • Makes regexes more portable since some flavors don't allow it (Ruby, and C++ with nosubs) and in other flavors the numbering of named captures is inconsistent (e.g. in JS it's left to right for all captures, but in .NET it's unnamed captures first followed by named captures).

@rauschma
Copy link
Author

@slevithan Great feedback, thanks! I updated the Gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment