Every operator must appear as the first character on a line (in column 0).
Operator | Description | Example |
---|---|---|
! |
Rewrite-rule | !(\w+)('re) → \1 \2 |
: |
Tokenization pattern | :[ \t]+ |
< |
File inclusion | <patterns.rpp |
# |
Iterative group definition | (see Iterative Groups) |
> |
Group call | >groupname |
@ |
Meta-information | @$Date: 2018-04-25 13:39:01 -0700$ |
; |
Comment | ;;; comments |
- Outside of a match
left right
| |
...|a...b|...
...|q...r|...
---|
shift
- Inside a match, outside a capture group
left right
| |
...|a...b(c...d)e...f|...
...|q...r(c...d)s...t|...
---|
shift
- Inside a match, inside a capture group
left right
| |
...|a...b(c...d)e...f|...
...|q...r(c...d)s...t|...
---------|
shift
When captured groups are output in a different order, their span indices are unioned. E.g., consider the following rule:
!(a)(b)(c) → → \1 \3 \2
When applied to the input xyz abc xyz
, it yields the following tokens
and spans:
<0:3> "xyz"
<4:5> "a"
<5:7> "c"
<5:7> "b"
<8:11> "xyz"
Note that the swapped b
and c
tokens share the same span. If
instead the first two were swapped (\2 \1 \3
), then all three would
get the same span (<4:7>
) because the characterization engine stops
tracking at the first out-of-order back-reference (i.e., it does not
resume tracking when things get "back on track"). Furthermore, unused
groups are also considered out of order (e.g., if the replacement
pattern were \2 \3
), because the first group \1
was not the first
back-reference. Lastly, if the match contained ungrouped text before
an out-of-order back-reference (e.g., if the match pattern were
!xyz (a)(b)(c)
), then that text's span is also part of the union.
-
Rules
!wo(n't) → will \1 :[ \t]+
-
Rewriting
position: 0 1 2 3 4 5 6 7 8 9 10 11 input: I ▁ w o n ' t ▁ g o smap: 1 0 0 0 0 0 0 0 0 0 0 0 emap: 0 0 0 0 0 0 0 0 0 0 0 -1 match: * * ( * ) position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 step 1: I ▁ smap: 0 0 0 emap: 0 0 0 step 2: w i l l ▁ smap: 0 -1 -2 -3 -4 emap: 1 0 -1 -2 -3 step 3: n ' t smap: -3 -3 -3 emap: -3 -3 -3 step 4: ▁ g o smap: -3 -3 -3 -3 emap: -3 -3 -3 -3 -
Tokenization
token lnk calculation I <0:1> <(1-0-1):(1-0)> will <2:4> <(3-0-1):(6-2)> n't <4:7> <(8-3-1):(10-3)> go <8:10> <(12-3-1):(13-3)>
Package | Language | License | Creator |
---|---|---|---|
LKB | Lisp | MIT | Stephan Oepen |
PET | C++ | LGPL 2.1 | Rebecca Dridan |
ACE | C | LGPL 2.1 | Woodley Packard |
agree | C# | ? | Glenn Slayden |
PyDelphin | Python | MIT | Michael Wayne Goodman |
Feature | LKB | PET | ACE | agree | PyDelphin |
---|---|---|---|---|---|
Perl Compatible Regular Expressions | ✔ | ✔ | ✔ | ? | - |
Maximum numbered groups | 9 | 9 | 99 | ||
Named capture groups | |||||
Default tokenization pattern | [ \t]+ |
||||
Implicit module loading | |||||
Iterative application | |||||
Available as a command | |||||
Available as a library |
Implementation | Supported Formats |
---|---|
LKB | yy |
PET | string, line, offsets, triples |
ACE | string, yy |
agree | |
PyDelphin |
Limitation | LKB | PET | ACE | agree | PyDelphin |
---|---|---|---|---|---|
Input requires trailing delimiter | ✔ | ||||
Requires configuration file | ✔ |
Other bugs:
- PET
- last input line must be terminated with a newline
- reordered groups output in wrong order with
--format offsets
Test | LKB | PET | ACE | agree | PyDelphin |
---|---|---|---|---|---|
PTB (msec) | |||||
PTB (diffs) |
Only use capture groups when the exact, full, captured string is needed in the replacement. If groups are needed for other purposes (e.g., repetition, optionality), use non-capturing groups.