Skip to content

Instantly share code, notes, and snippets.

@joperezr
Last active February 9, 2022 21:06
Show Gist options
  • Save joperezr/16a1f1e9b3c252016276a8c8bda4e45a to your computer and use it in GitHub Desktop.
Save joperezr/16a1f1e9b3c252016276a8c8bda4e45a to your computer and use it in GitHub Desktop.
Small Background about Regex

Brief Background on the inners of Regex

Contrary to the obvious, Regex type itself doesn't know how to match a pattern with a given input. But it does have a factory which aids in producing objects which do know how to do this.

namespace System.Text.RegularExpressions
{
  public class Regex
  {
    protected internal RegexRunnerFactory factory;
  }
}

This factory produces RegexRunners on demand which do in fact know how to perform the Matching operations. RegexRunnerFactory and RegexRunner types are both abstract, and we have concrete implementations today which peform the matching logic differently from one another. We call these concrete implementations the regex engines. There are currently 4 different engines we currently have, which are:

  • Interpreted engine (default)
  • Compiled engine (which uses refemit)
  • Source generated engine (new in .NET 7)
  • NonBacktracking engine (new in .NET 7)

All of these engines have concrete implementations of RegexRunner which will eventually perform the matching. This is how RegexRunner perform matches today:

namespace System.Text.RegularExpressions
{
  public abstract class RegexRunner
  {
     /// In charge of finding the next possible start position of a match
     protected abstract bool FindFirstChar();
     
     /// In charge of determining if the selected starting position is a match
     protected abstract void Go();
    
     /// Main entry point that gets called by Regex when trying to perform a match
     protected internal Match Scan(string input, ...)
     {
       //.. some initialization code
       
       while (true)
       {
          if (FindFirstChar()) // find the next starting position of a potential match
          {
            Go(); // check if that position was actually a match
            
            if (matchWasSuccessful)
            {
              return match; 
            }
          }
          
          if (weHaveReachedTheEndofTheInput)
          {
             return NoMatch; 
          }
         
          positionInTheInput++;
       }
     }
  }
}

Problems with the existing design

  • FindFirstChar and Go don't take in the input as a parameter and instead save it to a field and work over that field. This means that we can't easily add span APIs since spans can't be stored inside a reference type.
  • The scan loop logic of calling FindFirstChar and Go are based on a 20-year old design that has some flaws and doesn't necessarily make sense to all engines.

Proposal

namespace System.Text.RegularExpressions
{
  public class RegexRunner
  {
    // Add new Scan entry point which works over the passed in span
+    protected virtual void Scan(ReadOnlySpan<char> text);

    // Change abstract members into virtuals so that new concrete implementations only need to implement Scan
-    protected abstract bool FindFirstChar();
+    protected virtual bool FindFirstChar() => throw new NotImplementedException();
-    protected abstract void Go();
+    protected virtual void Go() => throw new NotImplementedException();
-    protected abstract void InitTrackCount();
+    protected virtual void InitTrackCount() => return 0;
  }
  
  public class Regex
  {
    // Add IsMatch span overloads
+   public bool IsMatch(ReadOnlySpan<char> input);
+   public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+   public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+   public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
  }
}

Prototype

dotnet/runtime#62509

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment