-
-
Save Davidebyzero/9090628 to your computer and use it in GitHub Desktop.
You've certainly been busy.
Congratulations on getting the factoring method for multiplication working; that was one devilish regex.
How does your engine do with this regression test?
Welcome back, teukon.
Congratulations on getting the factoring method for multiplication working; that was one devilish regex.
Thank you :)
How does your engine do with this regression test?
That test isn't all that useful to me. Virtually all the tests either test features I haven't implemented yet (which aren't part of ECMAScript), or tests bugs that Perl or PCRE had but my engine would never have.
What I really want to test is alternatives and the lazy and greedy backtracking of groups, and especially the interaction of all three.
I haven't implemented character classes yet, and \s\S\d\D\w\W
, but am about to.
Does the ECMAScript specification say whether ^
,$
,\b
,\B
can be followed by a quantifier? This would be useless, of course, but I've noticed Opera supports it, but Firefox and Chrome do not. All three support a quantifier on lookaheads, which is useless too, and I think goes against the specification (as it seems the ECMAScript specification calls lookaheads assertions).
They appear to break the specification in other ways, too. It says "An escape sequence of the form \ followed by a nonzero decimal number n matches the result of the nth set of capturing parentheses (see 15.10.2.11). It is an error if the regular expression has fewer than n capturing parentheses." But all three seem to make \n act as an octal escape in that case; for example, \141
matches a
.
That test isn't all that useful to me. Virtually all the tests either test features I haven't implemented yet (which aren't part of ECMAScript), or tests bugs that Perl or PCRE had but my engine would never have.
I see. I wasn't sure how far along you were but your statement that you'd found a bug in PCRE due to the accuracy of your engine in one particular area brought this list to mind.
Does the ECMAScript specification say whether
^
,$
,\b
,\B
can be followed by a quantifier?
Yes it does, and they can't. According to the spec., ^
, $
, \b
, and \B
are assertions. The constructor only allows quantifiers to follow atoms, not assertions.
I don't know why some browsers are allowing some quantified assertions.
With my Fibonacci regex, it takes 12.1 seconds. pcregrep takes 532 seconds (and I don't even want to time it with perl). Interesting — my regex runs much faster than yours under my engine, and your regex runs much faster than mine under PCRE.
That is interesting. Anyway, well done on getting your engine to be so fast with your Fibonacci regex. Is your engine able to handle very large Fibonacci numbers? If yes, is the engine still quick relative to Perl and PCRE?
They appear to break the specification in other ways, too. It says "An escape sequence of the form \ followed by a nonzero decimal number n matches the result of the nth set of capturing parentheses (see 15.10.2.11). It is an error if the regular expression has fewer than n capturing parentheses." But all three seem to make \n act as an octal escape in that case; for example,
\141
matchesa
.
Ugh! So ugly.
Well this is weird. In http://regexpal.com/ quantifiers are not allowed after assertions in Firefox, Chrome, or Opera. But in http://regex.alf.nu/ they're allowed after all assertions in Opera, and allowed after lookaheads in all three.
As for all the other bugs in Opera, and the octal escape ugliness in all three browsers, they happen in both sites.
There's another apparent break from the ECMAScript specification. It states that { and } are not valid PatternCharacters. However, in all three browsers, any {
or }
that is not in the precise form of a quantifier is treated as a literal. For example x{}
matches x{}
, x{,5}
matches x{,5}
, and the regexes x{1\}
, x\{1}
, and x{[1]}
all match the string x{1}
. (On the other hand, a quantifier that is in the precise form, but in the wrong context, is treated as an error.) The same happens in Perl and PCRE. I think this is a nice feature.
I think this is a nice feature.
We're just going to have to agree to disagree.
Found another bizarre PCRE bug:
$ echo 'abc def'|pcregrep -o '^.*?\b'
abc
$ echo 'abc def'|pcregrep -o '\babc'
abc
$ echo 'abc def'|pcregrep -o 'abc def\b'
abc def
$ echo 'aaa'|pcregrep -o '^.*?(?=a)'
a
$ echo 'aaa'|pcregrep -o '^.*?(?=aaa)'
$ echo 'aaa'|pcregrep -o '^.*(?=a)'
aa
$ echo 'aaa'|pcregrep -o '^.*?(^|$)'
aaa
$ echo 'aaa'|pcregrep -o '^.*?a'
a
Seems like a lazy search with a minimum count of 0 tries a count of 1 as the first possibility if the match following it is zero-length, only backtracking to a count of 0 for the match if it has to.
Perl does not have this bug:
$ echo 'abc def'|perl -E '@m = <> =~ /^.*?\b/g; print @m[0]'
$ echo 'abc def'|perl -E '@m = <> =~ /^.*\b/g; print @m[0]'
abc def
$ echo 'aaa'|perl -E '@m = <> =~ /^.*?(?=a)/g; print @m[0]'
$ echo 'aaa'|perl -E '@m = <> =~ /^.*(?=a)/g; print @m[0]'
aa
$ echo 'aaa'|perl -E '@m = <> =~ /^.*?(^|$)/g; print @m[0]'
$ echo 'aaa'|perl -E '@m = <> =~ /^.*?a/g; print @m[0]'
a
Weird. I'm surprised that there are so many bugs in common regex engines.
I implemented character classes :) and of course the first thing I tried was our robust Triples solution. It works perfectly.
I implemented character classes :) and of course the first thing I tried was our robust Triples solution. It works perfectly.
Brilliant.
Hi teukon!
Well I finally got it releasable and posted my regex engine on github. The name isn't final. There's still some polishing to be done (especially, adding parser error messages), but it is quite usable. Hopefully you'll be able to compile it without any trouble.
Great. I'll put this on my to-do list but I'm currently snowed under with work.
This gist has gotten very long, so I've started a new one to continue the discussion.
I templated the choice between comparing strings and comparing numbers, so now it's possible to choose between the two at runtime.
Now it's down to 48.5 seconds.
With my Fibonacci regex, it takes 12.1 seconds. pcregrep takes 532 seconds (and I don't even want to time it with perl). Interesting — my regex runs much faster than yours under my engine, and your regex runs much faster than mine under PCRE.
Edit: Interesting. The "square root via fourth root" algorithm is actually much slower under my engine than the plain minimal square root. It takes 7.2 seconds for the plain one to go up from sqrt(0)=0 to sqrt(14400)=120, but takes 26.5 seconds for the one that uses 4th root.