This is the follow-on article about writing an expression parser in Raku. In the previous article, I explained the background looked at some basic performance comparisons relating to data structures for parsing and ways to process them: lists, parse trees, recursive descent and iteration.
In this article, we'll have a look at the performance of various ways of processing strings, and then see how it all fits together in the expression parser.
How should we parse the expression string in Raku? The traditional way to build an expression parser is using a Finite State Machine, consuming one character at a time (if needed with one or more characters look-ahead) and keeping track of the identified portion of the string. This is very fast in a language such as C but in Raku I was not too sure, because in Raku a character is actually a string of length one, so every test against a character is a string comparison. On the other hand, Raku has a sophisticated regular expression engine. Yet another way is to turn the string into an array, and parse using list operations. Many possibilities to be tested:
constant NITERS = 100_000;
my $str='This means we need a stack per type of operation and run until the end of the expression';
my @chrs = $str.comb;
if (CASE==0) { # 5.8 s
for 1 .. NITERS -> $ct {
# map on an array of characters
my @words=();
my $word='';
map(-> \c {
if (c ne ' ') {
$word ~= c;
} else {
push @words, $word;
$word='';
}
}, @chrs);
push @words, $word;
}
} elsif CASE==1 { # 2.7 s
for 1 .. NITERS -> $ct {
# while with index through a string
my @words=();
my $str='This means we need a stack per type of operation and run until the end of the expression';
while my $idx=$str.index( ' ' ) {
push @words, $str.substr(0,$idx);
$str .= substr($idx+1);
}
push @words, $str;
}
} elsif CASE==2 { # 11.7 s
for 1 .. NITERS -> $ct {
# while on an array of characters
my @words=();
my @chrs_ = @chrs;
my $word='';
while @chrs_ {
my $chr = shift @chrs_;
if ($chr ne ' ') {
$word~=$chr;
} else {
push @words, $word;
$word='';
}
}
push @words, $word;
}
} elsif CASE==3 { # 101 s
for 1 .. NITERS -> $ct {
# while on a string using a regexp
my @words=();
my $str='This means we need a stack per type of operation and run until the end of the expression';
while $str.Bool {
$str ~~ s/^$<w> = [ \w+ ]//;
if ($<w>.Bool) {
push @words, $<w>.Str;
}
else {
$str ~~ s/^\s+//;
}
}
}
} elsif CASE==4 { # 64 s
for 1 .. NITERS -> $ct {
# reduce on an array of characters
my \res = reduce(
-> \acc, \c {
if (c ne ' ') {
acc[0],acc[1] ~ c;
} else {
( |acc[0], acc[1] ),'';
}
}, ((),''), |@chrs);
my @words = |res[0],res[1];
}
For the list-based version, the overhead is 1.6 s; for the string-based versions, 0.8s.
The results are rather striking. Clearly the regexp version is by far the slowest. This was a surprise because in my Perl implementation, the regexp version was twice as fast as next best choice. From the other implementations, the string-based FSM which uses the index
and substr
methods is by far the fastest, without the overhead it takes 1.9s s, which is more that 50 times faster than the regexp version. The map
based version comes second but is nearly twice as slow. What is surprising, and actually a bit disappointing, is that the reduce
based version, which works the same as the map
based one but works on immutable data, is also very slow, 64 s.
In any case, the choice is clear. It is possible to make the fastest version marginally faster (1.6 s instead of 1.9 s) by not reducing the string but instead moving the index through the string. However, for the full parser I want to have the convenience of the trim-leading
and starts-with
methods, so I choose to consume the string.
With the choices of string parsing and data structure (nested arrays with integer identifiers, see the first article) made, let's focus on the structure of the overall algorithm. Without going into details on the theory, we use a Finite State Machine, so the basic approach is to loop through a number of states and in every state perform a specific action. In the Perl version this was simple because we use regular expressions to identify tokens, so most of the state transitions are implicit. I wanted to keep this structure so I emulate the regexp s///
operation with comparisons, indexing and substring operations.
my $prev_lev=0;
my $lev=0;
my @ast=();
my $op;
my $state=0;
while (length($str)>0) {
# Match unary prefix operations
# Match terms
# Add prefix operations if matched
# Match binary operators
# Append to the AST
}
The matching rules and operations are very simple (I use <pattern>
and <integer>
as placeholders for the actual values). Here is the Perl version for reference:
- prefix operations:
if ( $str=~s/^<pattern>// ) { $state=<integer>; }
- terms:
if ( $str=~s/^(<pattern>)// ) { $expr_ast=[<integer>,$1]; }
- operators:
$prev_lev=$lev;
if ( $str=~s/^<pattern>// ) { $lev=<integer>; $op=<integer>; }
In the Raku version I used the given
/when
construct, which is as fast as an if
statement but a bit neater.
- prefix operations:
given $str {
when .starts-with(<token>) {
.=substr(<length of token>);
$state<integer>; }
- terms:
given $str
when .starts-with(<token start>) {
$expr_ast=[<integer>,$term]; }
- operators:
given $str {
when .starts-with(<token>) {
.=substr(<length of token>);
$lev=<integer>;
$op=<integer>;
}
One of the more complex patterns to match is the case of an identifier followed by an opening parenthesis with optional whitespace. Using regular expressions this pattern would be:
if $str ~~ s:i/^ $<token> = [ [a .. z] \w*] \s* \( // {
my $var=$<token>.Str;
...
}
Without regular expressions, we first check for a character between 'a' and 'z' using 'a' le .substr(0,1).lc le 'z'
. If that matches, we remove it from $str
and add it to $var
. Then we go in a while
loop for as long as there are characters that are alphanumeric or '_'. Then we strip any whitespace and test for '('.
when 'a' le (my $var = .substr(0,1)).lc le 'z' {
my $idx=1;
my $c = .substr($idx,1);
while 'a' le $c.lc le 'z' or $c eq '_'
or '0' le $c le '9' {
$var~=$c;
$c = .substr(++$idx,1);
}
.=substr($idx);
.=trim-leading;
if .starts-with('(') {
...
}
}
Another complex pattern is that for a floating point number. In Fortran, the pattern is more complicated because the sub-pattern .e
can be part of a floating-point constant but could also be the part of the equality operator .eq.
. Furthermore, the separator between the mantissa and the exponent can be not just e
but also d
or q
. So the regular expression is rather involved:
if (
(
!($str~~rx:i/^\d+\.eq/) and
$str~~s:i/^([\d*\.\d*][[e|d|q][\-|\+]?\d+]?)//
)
or
$str~~s:i/^(\d*[e|d|q][\-|\+]?\d+)//
) {
$real_const_str=$/.Str;
}
Without regular expression, the implementation is as follows. We first detect a character between 0 and 9 or a dot. Then we try to match the mantissa, separator, sign and exponent. The latter three are optional; if they are not present and the mantissa does not contain a dot, we have matched an integer.
when '0' le .substr(0,1) le '9' or .substr(0,1) eq '.' {
my $sep='';
my $sgn='';
my $exp='';
my $real_const_str='';
# first char of mantissa
my $mant = .substr(0,1);
# try and match more chars of mantissa
my $idx=1;
$h = .substr($idx,1);
while '0' le $h le '9' or $h eq '.' {
$mant ~=$h;
$h = .substr(++$idx,1);
}
$str .= substr($idx);
# reject .eq.
if not ($mant.ends-with('.') and .starts-with('eq',:i)) {
if $h.lc eq 'e' | 'd' | 'q' {
# we found a valid separator
$sep = $h;
my $idx=1;
$h =.substr(1,1);
# now check if there is a sign
if $h eq '-' or $h eq '+' {
++$idx;
$sgn = $h;
$h =.substr($idx,1);
}
# now check if there is an exponent
while '0' le $h le '9' {
++$idx;
$exp~=$h;
$h =.substr($idx,1);
}
$str .= substr($idx);
if $exp ne '' {
$real_const_str="$mant$sep$sgn$exp";
$expr_ast=[30,$real_const_str];
} else {
# parse error
}
} elsif index($mant,'.').Bool {
# a mantissa-only real number
$real_const_str=$mant;
$expr_ast=[30,$real_const_str];
}
else { # no dot and no sep, so an integer
$expr_ast=[29,$mant];
}
} else { # .eq., backtrack and carry on
$str ="$mant$str";
proceed;
}
}
A final example of how to handle patterns is the case of whitespace in comparison and logical operators. Fortran has operators of the form <dot word dot>
, for example .lt.
and .xor.
. But annoyingly, it allows whitespace between the dot and the word, e.g. . not .
. Using regular expressions, this is of course easy to handle, for example:
if $str~~s/^\.\s*ge\s*\.//) {
$lev=6;
$op=20;
}
I check for a pattern starting with a dot and which contains a space before the next dot. Then I remove all spaces from that substring using trans
and replace this original string with this trimmed version.
when .starts-with('.') and .index( ' ' )
and (.index( ' ' ) < (my $eidx = .index('.',2 ))) {
# Find the keyword with spaces
my $match = .substr(0, $eidx+1);
# remove the spaces
$match .= trans( ' ' => '' );
# update the string
$str = $match ~ .substr( $eidx+1);
proceed;
}
Overall the optimised expression parser in Raku is still very close to the Perl version. The key difference is that the Raku version does not use regular expressions. With the above examples I wanted to illustrate how it is possible to write code with the same functionality as a regular expression s///
operation, using some of Raku's built-in string operations:
substr
: substringindex
: location a a substring in a stringtrim-leading
: strip leading whitespacestarts-with
ends-with
trans
: used to remove whitespace using the' ' => ''
patternlc
: used in range tests instead of testing against both upper and lower casele
,lt
,ge
,gt
: for very handy range comparisons, e.g.'a' le $str le 'z'
The resulting code is of course much longer but arguably more readable than regular expressions, and currently four times faster.
All code for the tests is available in my GitHub repo.