Skip to content

Instantly share code, notes, and snippets.

@goodmami
Created November 25, 2019 14:35
Show Gist options
  • Save goodmami/16d907bc2a0e4408456ff596e1e263e6 to your computer and use it in GitHub Desktop.
Save goodmami/16d907bc2a0e4408456ff596e1e263e6 to your computer and use it in GitHub Desktop.
REPP notes

Regular Expression Preprocessing (REPP)

Specification

Modules

Operators

Every operator must appear as the first character on a line (in column 0).

Operator Description Example
! Rewrite-rule !(\w+)('re) → \1 \2
: Tokenization pattern :[ \t]+
< File inclusion <patterns.rpp
# Iterative group definition (see Iterative Groups)
> Group call >groupname
@ Meta-information @$Date: 2018-04-25 13:39:01 -0700$
; Comment ;;; comments

Iterative Groups

Characterization

  • Outside of a match
 left     right
    |     |
 ...|a...b|...

 ...|q...r|...
 ---|
shift
  • Inside a match, outside a capture group
 left     right
    |     |
 ...|a...b(c...d)e...f|...

 ...|q...r(c...d)s...t|...
 ---|
shift
  • Inside a match, inside a capture group
       left     right
          |     |
 ...|a...b(c...d)e...f|...

 ...|q...r(c...d)s...t|...
 ---------|
      shift

Reordered Groups

When captured groups are output in a different order, their span indices are unioned. E.g., consider the following rule:

!(a)(b)(c)  →   →   \1 \3 \2

When applied to the input xyz abc xyz, it yields the following tokens and spans:

<0:3> "xyz"
<4:5> "a"
<5:7> "c"
<5:7> "b"
<8:11> "xyz"

Note that the swapped b and c tokens share the same span. If instead the first two were swapped (\2 \1 \3), then all three would get the same span (<4:7>) because the characterization engine stops tracking at the first out-of-order back-reference (i.e., it does not resume tracking when things get "back on track"). Furthermore, unused groups are also considered out of order (e.g., if the replacement pattern were \2 \3), because the first group \1 was not the first back-reference. Lastly, if the match contained ungrouped text before an out-of-order back-reference (e.g., if the match pattern were !xyz (a)(b)(c)), then that text's span is also part of the union.

Examples

  • Rules

    !wo(n't)    →   will \1
    :[ \t]+
    
  • Rewriting

    position: 0 1 2 3 4 5 6 7 8 9 10 11
    input: I w o n ' t g o
    smap: 1 0 0 0 0 0 0 0 0 0 0 0
    emap: 0 0 0 0 0 0 0 0 0 0 0 -1
    match: * * ( * )
    position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    step 1: I
    smap: 0 0 0
    emap: 0 0 0
    step 2: w i l l
    smap: 0 -1 -2 -3 -4
    emap: 1 0 -1 -2 -3
    step 3: n ' t
    smap: -3 -3 -3
    emap: -3 -3 -3
    step 4: g o
    smap: -3 -3 -3 -3
    emap: -3 -3 -3 -3
  • Tokenization

    token lnk calculation
    I <0:1> <(1-0-1):(1-0)>
    will <2:4> <(3-0-1):(6-2)>
    n't <4:7> <(8-3-1):(10-3)>
    go <8:10> <(12-3-1):(13-3)>

Implementations

Package Language License Creator
LKB Lisp MIT Stephan Oepen
PET C++ LGPL 2.1 Rebecca Dridan
ACE C LGPL 2.1 Woodley Packard
agree C# ? Glenn Slayden
PyDelphin Python MIT Michael Wayne Goodman

Features

Feature LKB PET ACE agree PyDelphin
Perl Compatible Regular Expressions ? -
Maximum numbered groups 9 9 99
Named capture groups
Default tokenization pattern [ \t]+
Implicit module loading
Iterative application
Available as a command
Available as a library

Output Formats

Implementation Supported Formats
LKB yy
PET string, line, offsets, triples
ACE string, yy
agree
PyDelphin

Limitations and Bugs

Limitation LKB PET ACE agree PyDelphin
Input requires trailing delimiter
Requires configuration file

Other bugs:

  • PET
    • last input line must be terminated with a newline
    • reordered groups output in wrong order with --format offsets

Performance

Test LKB PET ACE agree PyDelphin
PTB (msec)
PTB (diffs)

Design Suggestions

Only use capture groups when the exact, full, captured string is needed in the replacement. If groups are needed for other purposes (e.g., repetition, optionality), use non-capturing groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment