Regular Expression Preprocessing (REPP)

Specification

Modules

Operators

Every operator must appear as the first character on a line (in column 0).

Operator	Description	Example
`!`	Rewrite-rule	`!(\w+)('re) → \1 \2`
`:`	Tokenization pattern	`:[ \t]+`
`<`	File inclusion	`<patterns.rpp`
`#`	Iterative group definition	(see Iterative Groups)
`>`	Group call	`>groupname`
`@`	Meta-information	`@$Date: 2018-04-25 13:39:01 -0700$`
`;`	Comment	`;;; comments`

Iterative Groups

Characterization

Outside of a match

 left     right
    |     |
 ...|a...b|...

 ...|q...r|...
 ---|
shift

Inside a match, outside a capture group

 left     right
    |     |
 ...|a...b(c...d)e...f|...

 ...|q...r(c...d)s...t|...
 ---|
shift

Inside a match, inside a capture group

       left     right
          |     |
 ...|a...b(c...d)e...f|...

 ...|q...r(c...d)s...t|...
 ---------|
      shift

Reordered Groups

When captured groups are output in a different order, their span indices are unioned. E.g., consider the following rule:

!(a)(b)(c)  →   →   \1 \3 \2

When applied to the input xyz abc xyz, it yields the following tokens and spans:

<0:3> "xyz"
<4:5> "a"
<5:7> "c"
<5:7> "b"
<8:11> "xyz"

Note that the swapped b and c tokens share the same span. If instead the first two were swapped (\2 \1 \3), then all three would get the same span (<4:7>) because the characterization engine stops tracking at the first out-of-order back-reference (i.e., it does not resume tracking when things get "back on track"). Furthermore, unused groups are also considered out of order (e.g., if the replacement pattern were \2 \3), because the first group \1 was not the first back-reference. Lastly, if the match contained ungrouped text before an out-of-order back-reference (e.g., if the match pattern were !xyz (a)(b)(c)), then that text's span is also part of the union.

Examples

Rules
```
!wo(n't)    →   will \1
:[ \t]+
```

Rewriting

position:	0	1	2	3	4	5	6	7	8	9	10	11
input:		I	▁	w	o	n	'	t	▁	g	o
smap:	1	0	0	0	0	0	0	0	0	0	0	0
emap:	0	0	0	0	0	0	0	0	0	0	0	-1
match:				*	*	(	*	)

position:	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
step 1:		I	▁
smap:	0	0	0
emap:	0	0	0
step 2:				w	i	l	l	▁
smap:				0	-1	-2	-3	-4
emap:				1	0	-1	-2	-3
step 3:									n	'	t
smap:									-3	-3	-3
emap:									-3	-3	-3
step 4:												▁	g	o
smap:												-3	-3	-3	-3
emap:												-3	-3	-3	-3

Tokenization

token lnk calculation

I <0:1> <(1-0-1):(1-0)>

will <2:4> <(3-0-1):(6-2)>

n't <4:7> <(8-3-1):(10-3)>

go <8:10> <(12-3-1):(13-3)>

token	lnk	calculation
I	<0:1>	<(1-0-1):(1-0)>
will	<2:4>	<(3-0-1):(6-2)>
n't	<4:7>	<(8-3-1):(10-3)>
go	<8:10>	<(12-3-1):(13-3)>

Implementations

Package	Language	License	Creator
LKB	Lisp	MIT	Stephan Oepen
PET	C++	LGPL 2.1	Rebecca Dridan
ACE	C	LGPL 2.1	Woodley Packard
agree	C#	?	Glenn Slayden
PyDelphin	Python	MIT	Michael Wayne Goodman

Features

Feature	LKB	PET	ACE	agree	PyDelphin
Perl Compatible Regular Expressions	✔	✔	✔	?	-
Maximum numbered groups		9	9		99
Named capture groups
Default tokenization pattern		`[ \t]+`
Implicit module loading
Iterative application
Available as a command
Available as a library

Output Formats

Implementation	Supported Formats
LKB	yy
PET	string, line, offsets, triples
ACE	string, yy
agree
PyDelphin

Limitations and Bugs

Limitation	LKB	PET	ACE	agree	PyDelphin
Input requires trailing delimiter			✔
Requires configuration file		✔

Other bugs:

PET
- last input line must be terminated with a newline
- reordered groups output in wrong order with --format offsets

Performance

Test	LKB	PET	ACE	agree	PyDelphin
PTB (msec)
PTB (diffs)

Design Suggestions

Only use capture groups when the exact, full, captured string is needed in the replacement. If groups are needed for other purposes (e.g., repetition, optionality), use non-capturing groups.

goodmami/repp.md

Select an option

No results found