Non-greedy catch-all in Tcls NFA seems to have quadratic time complexity (O(n**2)
),
see the comparision (greedy vs. non-greedy) for 1st and 2nd sub-REs (.*)
and (.*?)
below.
Explanation: a growth of n
(length of matched part, e. g. from 500 to 5000 chars) by a factor of 10
increases the time of evaluation by a factor of 100 (452µs vs 40598µs).
Note that the whole RE is anchored from both sides, so the greedyness even doesn't really matter here (the result of match will be the same in any case).
The 3rd RE is added for the illustration, that NFA seems not even consider the next RE (\s+
which shall definitely match)
in case of slow non-greedines, unless added explicitely, like in (.*?(?=\s))
, what then immedially switches to linear complexity O(n)
.
% foreach r {{(.*)} {(.*?)} {(.*?(?=\s))}} {
set re [string map [list \$\$r $r] {^([A-Z]+)\s+$$r\s+([A-Z]+/(?:\d+(?:\.\d+)))$}];
puts -nonewline [format %-12s $r];
foreach i {1 10 100 1000} {
set v "GET /abc[string repeat "+test" $i] HTTP/1.1";
puts -nonewline \t[expr {[string length $v]-17}]\t[lrange [timerate {regexp $re $v _ m u p} 500] 0 1];
flush stdout
}; puts ""
}
(.*) 5 10.3967 µs/# 50 13.2083 µs/# 500 40.9453 µs/# 5000 304.007 µs/#
(.*?) 5 9.55624 µs/# 50 17.9964 µs/# 500 452.075 µs/# 5000 40598.8 µs/#
(.*?(?=\s)) 5 21.0441 µs/# 50 51.8737 µs/# 500 356.371 µs/# 5000 3419.50 µs/#
Time to switch to PCRE or PCRE2 either...
Here is a small comparision to above example (used sebres/tcl#5 or [1860727] with PCRE v.8.45):
% ... regexp -type pcre $re $v _ m u p ...
(.*) 5 0.611236 µs/# 50 0.698305 µs/# 500 1.124000 µs/# 5000 3.770706 µs/#
(.*?) 5 0.653597 µs/# 50 0.951130 µs/# 500 4.092930 µs/# 5000 33.3906 µs/#
(.*?(?=\s)) 5 0.614973 µs/# 50 0.985540 µs/# 500 4.192480 µs/# 5000 34.7717 µs/#
By the way, also MariaDB sucessfully switched its regex-engine to PCRE in v.10.0.5 (IIRC)