Contrary to the obvious, Regex type itself doesn't know how to match a pattern with a given input. But it does have a factory which aids in producing objects which do know how to do this.
namespace System.Text.RegularExpressions
{
public class Regex
{
protected internal RegexRunnerFactory factory;
}
}
This factory produces RegexRunner
s on demand which do in fact know how to perform the Matching operations. RegexRunnerFactory
and RegexRunner
types are both abstract, and we have concrete implementations today which peform the matching logic differently from one another. We call these concrete implementations the regex engines.
There are currently 4 different engines we currently have, which are:
- Interpreted engine (default)
- Compiled engine (which uses refemit)
- Source generated engine (new in .NET 7)
- NonBacktracking engine (new in .NET 7)
All of these engines have concrete implementations of RegexRunner
which will eventually perform the matching. This is how RegexRunner
perform matches today:
namespace System.Text.RegularExpressions
{
public abstract class RegexRunner
{
/// In charge of finding the next possible start position of a match
protected abstract bool FindFirstChar();
/// In charge of determining if the selected starting position is a match
protected abstract void Go();
/// Main entry point that gets called by Regex when trying to perform a match
protected internal Match Scan(string input, ...)
{
//.. some initialization code
while (true)
{
if (FindFirstChar()) // find the next starting position of a potential match
{
Go(); // check if that position was actually a match
if (matchWasSuccessful)
{
return match;
}
}
if (weHaveReachedTheEndofTheInput)
{
return NoMatch;
}
positionInTheInput++;
}
}
}
}
- FindFirstChar and Go don't take in the input as a parameter and instead save it to a field and work over that field. This means that we can't easily add span APIs since spans can't be stored inside a reference type.
- The scan loop logic of calling FindFirstChar and Go are based on a 20-year old design that has some flaws and doesn't necessarily make sense to all engines.
namespace System.Text.RegularExpressions
{
public class RegexRunner
{
// Add new Scan entry point which works over the passed in span
+ protected virtual void Scan(ReadOnlySpan<char> text);
// Change abstract members into virtuals so that new concrete implementations only need to implement Scan
- protected abstract bool FindFirstChar();
+ protected virtual bool FindFirstChar() => throw new NotImplementedException();
- protected abstract void Go();
+ protected virtual void Go() => throw new NotImplementedException();
- protected abstract void InitTrackCount();
+ protected virtual void InitTrackCount() => return 0;
}
public class Regex
{
// Add IsMatch span overloads
+ public bool IsMatch(ReadOnlySpan<char> input);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
}
}